How to sum up on a column - bash

I have a file that looks like this :
A B C 128 D
Z F R 18 -
M W A 1 -
T B D 21 P
Z F R 11 -
L W A 10 D
I am looking for a way to sum up the column 4 ( for the lines witch column 5 look like D) here in this example will be: 128 + 10 = 138 .
I managed to sum up all the 4th column with this command :
cat file.txt |awk '{total+= $4} END {print total}'

You just omitted the pattern to select which lines your action applies to.
awk '$5 == "D" {total += $4} END {print total}' file.txt
In awk, the pattern is applied to each line of input, and if the pattern matches, the action is applied. If there is no pattern (as in your attempt), the line is unconditionally processed.

A solution with datamash and sort:
cat file.txt | sort -k5 | datamash -W -g5 sum 4
sort -k5 for sorting according the 5 column.
datamash uses -W to specify that whitespace is the separator, -g5 to group by 5th column, and finally sum 4 to get sum of 4th column.
It gives this output:
- 30
D 138
P 21

Related

awk insert rows of one file as new columns to every nth rows of another file

Let's keep n=3 here, and say I have two files:
file1.txt
a b c row1
d e f row2
g h i row3
j k l row4
m n o row5
o q r row6
s t u row7
v w x row8
y z Z row9
file2.txt
1 2 3
4 5 6
7 8 9
I would like to merge the two files into a new_file.txt:
new_file.txt
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
Currently I do this as follows (there are also slow bash for or while loop solutions, of course): awk '1;1;1' file2.txt > tmp2.txt and then something like awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' tmp2.txt file1.txt > new_file.txt for the case listed in my question.
Or put these in one line: awk '1;1;1' file2.txt | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - file1.txt > new_file.txt. But these do not look elegant at all...
I am looking for a more elegant one liner (perhaps awk) that can effectively do this.
In the real case, let's say for example I have 9 million rows in input file1.txt and 3 million rows in input file2.txt and I would like to append columns 2 and 3 of the first row of file2.txt as the new last columns of the first 3 rows of file1.txt, columns 2 and 3 of the second row of file2.txt as the same new last columns of the next 3 rows of file1.txt, etc, etc.
Thanks!
Try this, see mywiki.wooledge - Process Substitution for details on <() syntax
$ # transforming file2
$ cut -d' ' -f2-3 file2.txt | sed 'p;p'
2 3
2 3
2 3
5 6
5 6
5 6
8 9
8 9
8 9
$ # then paste it together with required fields from file1
$ paste -d' ' <(cut -d' ' -f1-3 file1.txt) <(cut -d' ' -f2-3 file2.txt | sed 'p;p')
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
Speed comparison, time shown for two consecutive runs
$ perl -0777 -ne 'print $_ x 1000000' file1.txt > f1
$ perl -0777 -ne 'print $_ x 1000000' file2.txt > f2
$ du -h f1 f2
95M f1
18M f2
$ time paste -d' ' <(cut -d' ' -f1-3 f1) <(cut -d' ' -f2-3 f2 | sed 'p;p') > t1
real 0m1.362s
real 0m1.154s
$ time awk '1;1;1' f2 | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - f1 > t2
real 0m12.088s
real 0m13.028s
$ time awk '{
if (c==3) c=0;
printf "%s %s %s ",$1,$2,$3;
if (!c++){ getline < "f2"; f4=$2; f5=$3 }
printf "%s %s\n",f4,f5
}' f1 > t3
real 0m13.629s
real 0m13.380s
$ time awk '{
if (c==3) c=0;
main_fields=$1 OFS $2 OFS $3;
if (!c++){ getline < "f2"; f4=$2; f5=$3 }
printf "%s %s %s\n", main_fields, f4, f5
}' f1 > t4
real 0m13.265s
real 0m13.896s
$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical
$ diff -s t1 t4
Files t1 and t4 are identical
Awk solution:
awk '{
if (c==3) c=0;
main_fields=$1 OFS $2 OFS $3;
if (!c++){ getline < "file2.txt"; f4=$2; f5=$3 }
printf "%s %s %s\n", main_fields, f4, f5
}' file1.txt
c - variable reflecting nth coefficient
getline < file - reads the next record from file
f4=$2; f5=$3 - contain the values of the 2nd and 3rd fields from currently read record of file2.txt
The output:
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9
This is still a lot slower than Sundeep's cut&paste code on the 100,000 lines test (8s vs 21s on my laptop) but perhaps easier to understand than the other Awk solution. (I had to play around for a bit before getting the indexing right, though.)
awk 'NR==FNR { a[FNR] = $2 " " $3; next }
{ print $1, $2, $3, a[1+int((FNR-1)/3)] }' file2.txt file1.txt
This simply keeps (the pertinent part of) file2.txt in memory and then reads file1.txt and writes out the combined lines. That also means it is limited by available memory, whereas Roman's solution will scale to basically arbitrarily large files (as long as each line fits in memory!) but slightly faster (I get 28s real time for Roman's script with Sundeep's 100k test data).

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Sorting lines in one file given the order in another file

Given a file1:
13 a b c d
5 f a c d
7 d c g a
14 a v s d
and a file2:
7 x
5 c
14 a
13 i
I would like to sort file1 considering the same order of the first column in file2, so that the output should be:
7 d c g a
5 f a c d
14 a v s d
13 a b c d
Is it possible to do this in bash or should I use some "higher" language like python?
Use awk to put the line number from file2 as an extra column in front of file1. Sort the result by that column. Then remove that prefix column
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' file2 file1 | sort -k 1,1n | cut -d' ' -f2-
Simple solution
for S in $(cat file2 | awk '{print $1}'); do grep $S file1; done

Custom Sort Multiple Files

I have 10 files (1Gb each). The contents of the files are as follows:
head -10 part-r-00000
a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3
I need to sort the file based on the last column of each line (descending order), which is of variable length. If there is a match on the numerical value then sort alphabetically by the 2nd last column. The following BASH command works on small datasets (not complete files) and takes 3 second to sort only 10 lines from one file.
cat part-r-00000 | awk '{print $NF,$0}' | sort -nr | cut -f2- -d' ' > FILE
I want the output in a separate FILE. Can someone help me out to speed up the process?
No, once you get rid of the UUOC that's as fast as it's going to get. Obviously you need to add the 2nd-last field to everything too, e.g. something like:
awk '{print $NF,$(NF-1),$0}' part-r-00000 | sort -k1,1nr -k2,2 | cut -f3- -d' '
Check the sort args, I always get mixed up with those..
Reverse order, sort and reverse order:
awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}' file | sort -nr | awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}'
Output:
a a a h o 4
a a a k 3
a a a general i 2
a a a glory 2
a a a h z 1
a a a hem hem 1
a a a dumbbell 1
a a a h d 1
a a a c b 1
a a a f a 1
You can use a Schwartzian transform to accomplish your task,
awk '{print -$NF, $(NF-1), $0}' input_file | sort -n | cut -d' ' -f3-
The awk command prepends each record with the negative of the last field and the second last field.
The sort -n command sorts the record stream in the required order because we used the negative of the last field.
The cut command splits on spaces and cuts the first two fields, i.e., the ones we used to normalize the sort
Example
$ echo 'a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3' | awk '{print -$NF, $(NF-1), $0}' | sort -n | cut -d' ' -f3-
a a a h o 4
a a a k 3
a a a glory 2
a a a general i 2
a a a f a 1
a a a c b 1
a a a h d 1
a a a dumbbell 1
a a a hem hem 1
a a a h z 1
$

AWK -- How to do selective multiple column sorting?

In awk, how can I do this:
Input:
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
Desired output, by sorting numerical value at $5:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Note that the sorting should only affecting $4, $5, and $6 (based on value of $5), in which the previous part of table remains intact.
This could be done in multiple steps with the help of paste:
$ gawk '{print $1, $2, $3}' in.txt > a.txt
$ gawk '{print $4, $5, $6}' in.txt | sort -k 2 -n b.txt > b.txt
$ paste -d' ' a.txt b.txt
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Personally, I find using awk to safely sort arrays of columns rather tricky because often you will need to hold and sort on duplicate keys. If you need to selectively sort a group of columns, I would call paste for some assistance:
paste -d ' ' <(awk '{ print $1, $2, $3 }' file.txt) <(awk '{ print $4, $5, $6 | "sort -k 2" }' file.txt)
Results:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
This can be done in pure awk, but as #steve said, it's not ideal. gawk has limited sort functions, and awk has no built-in sort at all. That said, here's a (rather hackish) solution using a compare function in gawk:
[ghoti#pc ~/tmp3]$ cat text
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
[ghoti#pc ~/tmp3]$ cat doit.gawk
### Function to be called by asort().
function cmp(i1,v1,i2,v2) {
split(v1,a1); split(v2,a2);
if (a1[2]>a2[2]) { return 1; }
else if (a1[2]<a2[2]) { return -1; }
else { return 0; }
}
### Left-hand-side and right-hand-side, are sorted differently.
{
lhs[NR]=sprintf("%s %s %s",$1,$2,$3);
rhs[NR]=sprintf("%s %s %s",$4,$5,$6);
}
END {
asort(rhs,sorted,"cmp"); ### This calls the function we defined, above.
for (i=1;i<=NR;i++) { ### Step through the arrays and reassemble.
printf("%s %s\n",lhs[i],sorted[i]);
}
}
[ghoti#pc ~/tmp3]$ gawk -f doit.gawk text
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
[ghoti#pc ~/tmp3]$
This keeps your entire input file in arrays, so that lines can be reassembled after the sort. If your input is millions of lines, this may be problematic.
Note that you might want to play with the printf and sprintf functions to set appropriate output field separators.
You can find documentation on using asort() with functions in the gawk man page; look for PROCINFO["sorted_in"].

Resources