How can I use awk to sort columns by the last value of a column? - bash

I have a file like this (with hundreds of lines and columns)
1 2 3
4 5 6
7 88 9
and I would like to re-order columns basing on the last line values (or a specific line values)
1 3 2
4 6 5
7 9 88
How can I use awk (or other) to accomplish this task?
Thank you in advance for your help
EDIT: I would like to thank everybody and to apologize if I wasn't enough clear.
What I would like to do is:
take a line (for example the last one);
reorder the columns of the matrix using the sorted values of the chosen line to derermine the order.
So, the last line is 7 88 9, which sorted is 7 9 88, then the three columns have to be reordered in a way such that, in this case, the last two columns are swapped.
A four-column more generic example, based on the last line again:
Input:
1 2 3 4
4 5 6 7
7 88.0 9 -3
Output:
4 1 3 2
7 4 6 5
-3 7 9 88.0

Here's a quick, dirty and improvable solution: (edited because OP clarified that numbers are floating point).
$ cat test.dat
1 2 3
4 5 6
.07 .88 -.09
$ awk "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
3 1 2
6 4 5
-.09 .07 .88
To see what's going on there (or at least part of it):
$ echo "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
{print $3,$1,$2} test.dat
To make it work for an arbitrary line, replace tail -n1 with tail -n+$L|head -n1

This problem can be elegantly solved using GNU awk's array sorting feature. GNU awk allows you to control array traversal using PROCINFO. So two passes of the file are required, the first pass to split the last record into an array and the second pass to loop through the indices of the array in value order and output fields based on indices. The code below probably explains it better than I do.
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"};
NR == FNR {for (x in arr) delete arr[x]; split($0, arr)};
NR != FNR{sep=""; for (x in arr) {printf sep""$x; sep=" "} print ""}' file.txt file.txt
4 1 3 2
7 4 6 5
-3 7 9 88.0

Update:
Create a file called transpose.awk like this:
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str OFS a[i,j];
}
print str
}
}
Now here is the script that should do work for you:
awk -f transpose.awk file | sort -n -k $(awk 'NR==1{print NF}' file) | awk -f transpose.awk
1 3 2
4 6 5
7 9 88
I am using transpose.awk twice here. Once to transpose rows to columns then I am doing numeric sorting by last column and then again I am transposing rows to columns. It may not be most efficient solution but it is something that works as per the OP's requirements.
transposing awk script courtesy of: #ghostdog74 from An efficient way to transpose a file in Bash

Related

awk with loops to reorder columns

I am trying to reorder the columns of a file writting a awk programn. The file looks like:
My little program to reorder the columns is:
awk -v column=number 'BEGIN {FS=","; ORS="\n"; OFS=","; n=column} {for (i=1; i<=NF; i++){if (i!=n) $(i+1)=$i else $1=$i} {print $0}' file_name
I would like to put first the column given with number and then the remaing ones, but it does not work
You are overwriting fields as you iterate. You should instead "bubble" the value from position in column to the first position.
Consider how would you move column 3 here:
1 2 3 4
to
3 1 2 4
For example with this input file:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
You could do it like this (separators changed to whitespace for readability):
$ awk -v col=3 '{val=$col; for (i=col; i>1; i--) $i=$(i-1); $1=val; print $0}' table
3 1 2 4
4 2 3 5
5 3 4 6
6 4 5 7
You can simply print the required column first and then the rest of columns.
$ awk -v column=3 -F "," '{n=column; printf $column; for (i=1; i<NF; i++){if (i!=n) printf ","$i} print ","$NF}' file
Input:
10,Hello,meow,20,30
hello,world,34,meow,60
Output:
meow,10,Hello,20,30
34,hello,world,meow,60

how to sort values of 1st column and print only corresponding values of third column using awk

I tried to sort the file using awk '{print $0|"sort -t',' -nk1 "}' but I want to print only the third column of sorted file. Input file:
1 4 7 9
9 7 4 1
4 6 8 9
1 2 3 4
5 4 5 2
the expected output:
3
7
8
5
4
Try this -
sort file|awk '{print $3}'
3
7
8
5
4
Two simple ways "to print only the third column of sorted file":
with tr + cut command:
sort -n file | tr -s ' ' | cut -d' ' -f3
with awk:
sort -n file | awk '{print $3}'
Well, since I already started doing it using Gnu awk's asorti:
$ awk '
{ a[$1 "," $3]=$3 } # get the data to a hash (*)
END { n=asorti(a) # sort a by the index
for(i=1;i<=n;i++) { # for each ordered index
split(a[i],b,",") # split and
print b[2] # print the latter part
}
}' file
3
7
8
5
4
(*) If using only $1 as the key there will be a collision when $1=1. For this data using $1 "," $3 won't produce a collision and will sort the $3 as well. However, there could be a collision. The correct way would be to keep a count of the keys and have a sub for loop to print those out (or have keys and values in different arrays indexed with NR). That will be left as an exersize.

Subset a file by row and column numbers

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Sum of all rows of all columns - Bash

I have a file like this
1 4 7 ...
2 5 8
3 6 9
And I would like to have as output
6 15 24 ...
That is the sum of all the lines for all the columns. I know that to sum all the lines of a certain column (say column 1) you can do like this:
awk '{sum+=$1;}END{print $1}' infile > outfile
But I can't do it automatically for all the columns.
One more awk
awk '{for(i=1;i<=NF;i++)$i=(a[i]+=$i)}END{print}' file
Output
6 15 24
Explanation
{for (i=1;i<=NF;i++) Set field to 1 and increment through
$i=(a[i]+=$i) Set the field to the sum + the value in field
END{print} Print the last line which now contains the sums
As with the other answers this will retain the order of the fields regardless of the number of them.
You want to sum every column differently. Hence, you need an array, not a scalar:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) print sum[i]}' file
6
15
24
This stores sum[column] and finally prints it.
To have the output in the same line, use:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) printf "%d%s", sum[i], (i==NF?"\n":" ")}' file
6 15 24
This uses the trick printf "%d%s", sum[i], (i==NF?"\n":" "): print the digit + a character. If we are in the last field, let this char be new line; otherwise, just a space.
There is a very simple command called numsum to do this:
numsum -c FileName
-c --- Print out the sum of each column.
For example:
cat FileName
1 4 7
2 5 8
3 6 9
Output :
numsum -c FileName
6 15 24
Note:
If the command is not installed in your system, you can do it with this command:
apt-get install num-utils
echo "1 4 7
2 5 8
3 6 9 " \
| awk '{for (i=1;i<=NF;i++){
sums[i]+=$i;maxi=i}
}
END{
for(i=1;i<=maxi;i++){
printf("%s ", sums[i])
}
print}'
output
6 15 24
My recollection is that you can't rely on for (i in sums) to produce the keys any particular order, but maybe this is "fixed" in newer versions of gawk.
In case you're using an old-line Unix awk, this solution will keep your output in the same column order, regardless of how "wide" your file is.
IHTH
AWK Program
#!/usr/bin/awk -f
{
print($0);
len=split($0,a);
if (maxlen < len) {
maxlen=len;
}
for (i=1;i<=len;i++) {
b[i]+=a[i];
}
}
END {
for (i=1;i<=maxlen;i++) {
printf("%s ", b[i]);
}
print ""
}
Output
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
3 6 9 12 15
Your answer is correct. It is just missed to print "sum". Try this:
awk '{sum+=$1;} END{print sum;}' infile > outfile

how to sum up matrices in multiple files using bash or awk

If I have an arbitrary number of files, say n files, and each file contains a matrix, how can I use bash or awk to sum up all the matrices in each file and get an output?
For example, if n=3, and I have these 3 files with the following contents
$ cat mat1.txt
1 2 3
4 5 6
7 8 9
$cat mat2.txt
1 1 1
1 1 1
1 1 1
$ cat mat3.txt
2 2 2
2 2 2
2 2 2
I want to get this output:
$ cat output.txt
4 5 6
7 8 9
10 11 12
Is there a simple one liner to do this?
Thanks!
$ awk '{for (i=1;i<=NF;i++) total[FNR","i]+=$i;} END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print "";}}' mat1.txt mat2.txt mat3.txt
4 5 6
7 8 9
10 11 12
This will automatically adjust to different size matrices. I don't believe that I have used any GNU features so this should be portable to OSX and elsewhere.
How it works:
This command reads from each line from each matrix, one matrix at a time.
For each line read, the following command is executed:
for (i=1;i<=NF;i++) total[FNR","i]+=$i
This loops over every column on the line and adds it to the array total.
GNU awk has multidimensional arrays but, for portability, they are not used here. awk's arrays are associative and this creates an index from the file's line number, FNR, and the column number i, by combining them together with a comma. The result should be portable.
After all the matrices have been read, the results in total are printed:
END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print ""}}
Here, j loops over each line up to the total number of lines, FNR. Then i loops over each column up to the total number of columns, NF. For each row and column, the total is printed via printf "%3i ",total[j","i]. This prints the total as a 3-character-wide integer. If you numbers are float or are bigger, adjust the format accordingly.
At the end of each row, the print "" statement causes a newline character to be printed.
You can use awk with paste:
awk -v n=3 '{for (i=1; i<=n; i++) printf "%s%s", ($i + $(i+n) + $(i+n*2)),
(i==n)?ORS:OFS}' <(paste mat{1,2,3}.txt)
4 5 6
7 8 9
10 11 12
GNU awk has multi-dimensional arrays.
gawk '
{
for (i=1; i<=NF; i++)
m[i][FNR] += $i
}
END {
for (y=1; y<=FNR; y++) {
for (x=1; x<=NF; x++)
printf "%d ", m[x][y]
print ""
}
}
' mat{1,2,3}.txt

Resources