Subset a file by row and column numbers - bash

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7

Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.

One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.

Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.

to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt

Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Related

How to sum rows in a tsv file using awk?

My input:
Position A B C D No
1 0 0 0 0 0
2 1 0 1 0 0
3 0 6 0 0 0
4 0 0 0 0 0
5 0 5 0 0 0
I have a TSV file, like the above, where I wish to sum the rows of numbers in the ABCD columns only, not the Position column.
Desired output would have a TSV, two columns with Position and Sum in the first row,
Position Sum
1 0
2 2
3 6
4 0
5 5
So far I have:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5 printf"%d\t%d\n",$sum}' infile.tsv > outfile.tsv
You were very close, try this:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5; printf "%d\t%d\n",$1,sum; }' infile.tsv > outfile.tsv
But I say it's way cleaner with newlines and spaces:
awk '
BEGIN {
print"Position\tSum";
}
{
if (NR==1) {
next;
}
sum = $2 + $3 + $4 + $5 + $6;
printf "%d\t%d\n", $1, sum;
}'
a minimalist script can be
$ awk '{print $1 "\t" (NR==1?"Sum":$2+$3+$4+$5)}' file
Could you please try following, what you were trying to hard code field numbers which will not work in many cases so I am coming with a loop approach(where we are skipping first field and taking sum of all fields then).
awk 'FNR==1{print $1,"sum";next} {for(i=2;i<NF;i++){sum+=$i};print $1,sum;sum=""}' Input_file
Change awk to awk 'BEGIN{OFS="\t"} rest part same of code in case you need output in TAB form.

Sum of all rows of all columns - Bash

I have a file like this
1 4 7 ...
2 5 8
3 6 9
And I would like to have as output
6 15 24 ...
That is the sum of all the lines for all the columns. I know that to sum all the lines of a certain column (say column 1) you can do like this:
awk '{sum+=$1;}END{print $1}' infile > outfile
But I can't do it automatically for all the columns.
One more awk
awk '{for(i=1;i<=NF;i++)$i=(a[i]+=$i)}END{print}' file
Output
6 15 24
Explanation
{for (i=1;i<=NF;i++) Set field to 1 and increment through
$i=(a[i]+=$i) Set the field to the sum + the value in field
END{print} Print the last line which now contains the sums
As with the other answers this will retain the order of the fields regardless of the number of them.
You want to sum every column differently. Hence, you need an array, not a scalar:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) print sum[i]}' file
6
15
24
This stores sum[column] and finally prints it.
To have the output in the same line, use:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) printf "%d%s", sum[i], (i==NF?"\n":" ")}' file
6 15 24
This uses the trick printf "%d%s", sum[i], (i==NF?"\n":" "): print the digit + a character. If we are in the last field, let this char be new line; otherwise, just a space.
There is a very simple command called numsum to do this:
numsum -c FileName
-c --- Print out the sum of each column.
For example:
cat FileName
1 4 7
2 5 8
3 6 9
Output :
numsum -c FileName
6 15 24
Note:
If the command is not installed in your system, you can do it with this command:
apt-get install num-utils
echo "1 4 7
2 5 8
3 6 9 " \
| awk '{for (i=1;i<=NF;i++){
sums[i]+=$i;maxi=i}
}
END{
for(i=1;i<=maxi;i++){
printf("%s ", sums[i])
}
print}'
output
6 15 24
My recollection is that you can't rely on for (i in sums) to produce the keys any particular order, but maybe this is "fixed" in newer versions of gawk.
In case you're using an old-line Unix awk, this solution will keep your output in the same column order, regardless of how "wide" your file is.
IHTH
AWK Program
#!/usr/bin/awk -f
{
print($0);
len=split($0,a);
if (maxlen < len) {
maxlen=len;
}
for (i=1;i<=len;i++) {
b[i]+=a[i];
}
}
END {
for (i=1;i<=maxlen;i++) {
printf("%s ", b[i]);
}
print ""
}
Output
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
3 6 9 12 15
Your answer is correct. It is just missed to print "sum". Try this:
awk '{sum+=$1;} END{print sum;}' infile > outfile

how to sum up matrices in multiple files using bash or awk

If I have an arbitrary number of files, say n files, and each file contains a matrix, how can I use bash or awk to sum up all the matrices in each file and get an output?
For example, if n=3, and I have these 3 files with the following contents
$ cat mat1.txt
1 2 3
4 5 6
7 8 9
$cat mat2.txt
1 1 1
1 1 1
1 1 1
$ cat mat3.txt
2 2 2
2 2 2
2 2 2
I want to get this output:
$ cat output.txt
4 5 6
7 8 9
10 11 12
Is there a simple one liner to do this?
Thanks!
$ awk '{for (i=1;i<=NF;i++) total[FNR","i]+=$i;} END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print "";}}' mat1.txt mat2.txt mat3.txt
4 5 6
7 8 9
10 11 12
This will automatically adjust to different size matrices. I don't believe that I have used any GNU features so this should be portable to OSX and elsewhere.
How it works:
This command reads from each line from each matrix, one matrix at a time.
For each line read, the following command is executed:
for (i=1;i<=NF;i++) total[FNR","i]+=$i
This loops over every column on the line and adds it to the array total.
GNU awk has multidimensional arrays but, for portability, they are not used here. awk's arrays are associative and this creates an index from the file's line number, FNR, and the column number i, by combining them together with a comma. The result should be portable.
After all the matrices have been read, the results in total are printed:
END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print ""}}
Here, j loops over each line up to the total number of lines, FNR. Then i loops over each column up to the total number of columns, NF. For each row and column, the total is printed via printf "%3i ",total[j","i]. This prints the total as a 3-character-wide integer. If you numbers are float or are bigger, adjust the format accordingly.
At the end of each row, the print "" statement causes a newline character to be printed.
You can use awk with paste:
awk -v n=3 '{for (i=1; i<=n; i++) printf "%s%s", ($i + $(i+n) + $(i+n*2)),
(i==n)?ORS:OFS}' <(paste mat{1,2,3}.txt)
4 5 6
7 8 9
10 11 12
GNU awk has multi-dimensional arrays.
gawk '
{
for (i=1; i<=NF; i++)
m[i][FNR] += $i
}
END {
for (y=1; y<=FNR; y++) {
for (x=1; x<=NF; x++)
printf "%d ", m[x][y]
print ""
}
}
' mat{1,2,3}.txt

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

How can I use awk to sort columns by the last value of a column?

I have a file like this (with hundreds of lines and columns)
1 2 3
4 5 6
7 88 9
and I would like to re-order columns basing on the last line values (or a specific line values)
1 3 2
4 6 5
7 9 88
How can I use awk (or other) to accomplish this task?
Thank you in advance for your help
EDIT: I would like to thank everybody and to apologize if I wasn't enough clear.
What I would like to do is:
take a line (for example the last one);
reorder the columns of the matrix using the sorted values of the chosen line to derermine the order.
So, the last line is 7 88 9, which sorted is 7 9 88, then the three columns have to be reordered in a way such that, in this case, the last two columns are swapped.
A four-column more generic example, based on the last line again:
Input:
1 2 3 4
4 5 6 7
7 88.0 9 -3
Output:
4 1 3 2
7 4 6 5
-3 7 9 88.0
Here's a quick, dirty and improvable solution: (edited because OP clarified that numbers are floating point).
$ cat test.dat
1 2 3
4 5 6
.07 .88 -.09
$ awk "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
3 1 2
6 4 5
-.09 .07 .88
To see what's going on there (or at least part of it):
$ echo "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
{print $3,$1,$2} test.dat
To make it work for an arbitrary line, replace tail -n1 with tail -n+$L|head -n1
This problem can be elegantly solved using GNU awk's array sorting feature. GNU awk allows you to control array traversal using PROCINFO. So two passes of the file are required, the first pass to split the last record into an array and the second pass to loop through the indices of the array in value order and output fields based on indices. The code below probably explains it better than I do.
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"};
NR == FNR {for (x in arr) delete arr[x]; split($0, arr)};
NR != FNR{sep=""; for (x in arr) {printf sep""$x; sep=" "} print ""}' file.txt file.txt
4 1 3 2
7 4 6 5
-3 7 9 88.0
Update:
Create a file called transpose.awk like this:
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str OFS a[i,j];
}
print str
}
}
Now here is the script that should do work for you:
awk -f transpose.awk file | sort -n -k $(awk 'NR==1{print NF}' file) | awk -f transpose.awk
1 3 2
4 6 5
7 9 88
I am using transpose.awk twice here. Once to transpose rows to columns then I am doing numeric sorting by last column and then again I am transposing rows to columns. It may not be most efficient solution but it is something that works as per the OP's requirements.
transposing awk script courtesy of: #ghostdog74 from An efficient way to transpose a file in Bash

Resources