awk Count number of occurrences - bash

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1

Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.

Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Related

How to get the number of different lines in bash [duplicate]

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2
how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2
uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).
if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1
Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2
In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done
In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

distribute data in both increment and decrement order

I have a file which has n number of rows, i want it's data to be distributed in 7 files as per below order
** my input file has n number of rows, this is just an example.
Input file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
5
16
17
.
.
28
Output file
1 2 3 4 5 6 7
14 13 12 11 10 9 8
15 16 17 18 19 20 21
28 27 26 25 24 23 22
so if i open the first file it should have rows
1
14
15
28
similarly if i open the second file it should have rows
2
13
16
27
similarly output for the other files as well.
Can anybody please help, with below code it is doing what is required but not in required order.
awk '{print > ("te1234"++c".txt");c=(NR%n)?c:0}' n=7 test6.txt
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
EDIT: Since OP has changed sample of Input_file totally different so adding this solution now, again this is written and tested with shown samples only.
With xargs + single awk: (recommended one)
xargs -n7 < Input_file |
awk '
FNR%2!=0{
for(i=1;i<=NF;i++){
print $i >> (i".txt")
close(i".txt")
}
next
}
FNR%2==0{
for(i=NF;i>0;i--){
count++
print $i >> (count".txt")
close(i".txt")
}
count=""
}'
Initial solution:
xargs -n7 < Input_file |
awk '
FNR%2==0{
for(i=NF;i>0;i--){
val=(val?val OFS:"")$i
}
$i=val
val=""
}
1' |
awk '
{
for(i=1;i<=NF;i++){
print $i >> (i".txt")
close(i".txt")
}
}'
Above could be done with single awk too will add xargs + awk(single) solution in few mins too.
Could you please try following, written and tested with shown samples in GNU awk.
awk '{for(i=1;i<=NF;i++){print $i >> (i".txt");close(i".txt")}}' Input_file
The output file counter could descend for each second group of seven:
awk 'FNR%n==1 {asc=!asc}
{
out="te1234" (asc ? ++c : c--) ".txt";
print >> out;
close(out)
}' n=7 test6.txt
$ ls
file tst.awk
$ cat tst.awk
{ rec = (cnt % 2 ? $1 sep rec : rec sep $1); sep=FS }
!(NR%n) {
++cnt
nf = split(rec,flds)
for (i=1; i<=nf; i++) {
out = "te1234" i ".txt"
print flds[i] >> out
close(out)
}
rec=sep=""
}
.
$ awk -v n=7 -f tst.awk file
.
$ ls
file te12342.txt te12344.txt te12346.txt tst.awk
te12341.txt te12343.txt te12345.txt te12347.txt
$ cat te12341.txt
1
14
15
28
$ cat te12342.txt
2
13
16
27
If you can have input that's not an exact multiple of n then move the code that's currently in the !(NR%n) block into a function and call that function there and in an END section.
This might work for you (GNU sed & parallel):
parallel 'echo {1}~14w file{1}; echo {2}~14w file{1}' ::: {1..7} :::+ {14..8} |
sed -n -f - file &&
paste file{1..7}
Create a sed script to write files named filen where n is 1 thru 7 (see above first set of parameters in the parallel command and also in the paste command).
The sed script uses the n~m address where n is the starting address and m is the modulo thereafter.
The distributed files are created first and the paste command then joins them all together to produce a single output file (tab separated by default, use paste -d option to get desired delimiter).
Alternative using Bash & sed:
for ((n=1,m=14;n<=7;n++,m--));do echo "$n~14w file$n";echo "$m~14w file$n";done |
sed -nf - file &&
paste file{1..7}

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Shell script to find common values and write in particular pattern with subtraction math to range pattern

Shell script to find common values and write in particular pattern with subtraction math to range pattern
Shell script to get command values in two files and write i a pattern to new file AND also have the first value of the range pattern to be subtracted by 1
$ cat file1
2
3
4
6
7
8
10
12
13
16
20
21
22
23
27
30
$ cat file2
2
3
4
8
10
12
13
16
20
21
22
23
27
Script that works:
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s ":" l; s=$1}'
Script out:
2:4
8
10
12:13
16
20:23
27
Desired output:
1:4
8
10
11:13
16
19:23
27
Similar to sputnick's, except using comm to find the intersection of the file contents.
comm -12 <(sort file1) <(sort file2) |
sort -n |
awk '
function print_range() {
if (start != prev)
printf "%d:", start-1
print prev
}
FNR==1 {start=prev=$1; next}
$1 > prev+1 {print_range(); start=$1}
{prev=$1}
END {print_range()}
'
1:4
8
10
11:13
16
19:23
27
Try doing this :
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 |
sort |
awk 'NR==1 {s=l=$1; next}
$1!=l+1 {if(l == s) print l; else print s -1 ":" l; s=$1}
{l=$1}
END {if(l == s) print l; else print s -1 ":" l; s=$1}'

Resources