Bash - how to print the number of times a value in a column occurs at the end of the row - bash

I have a tab delimited file...
123 1:2334523 yes
127 1:332443 yes
113 1:332443 no
115 1:55434 no
115 1:55434 no
115 1:55434 yes
I would like to count the number of times the value in column 2 appears in column 2 and then print it to the the end of the row like...
123 1:2334523 yes 1
127 1:332443 yes 2
113 1:332443 no 2
115 1:55434 no 3
115 1:55434 no 3
115 1:55434 yes 3
So in column 2 1:332443 appears twice and 1:55434 appears 3 times.
I assume this should be relatively easy in Awk or sed but have not managed to figure it out.

You could do it like this:
awk 'NR == FNR { ++ctr[$2]; next } { print $0 "\t" ctr[$2]; }' filename filename
Because we need to know the counters before printing, we need two passes over the file, that's why filename is mentioned twice. The awk code then is:
NR == FNR { # if the record number is the same as the record number in the
# current file (that is: in the first pass)
++ctr[$2] # count how often field 2 showed up
next # don't do anything else for the first pass
}
{ # then in the second pass:
print $0 "\t" ctr[$2]; # print the line, a tab, and the counter.
}

Here is an awk that read the file just once:
awk '{a[NR]=$0;b[$2]++;c[NR]=$2} END {for (i=1;i<=NR;i++) print a[i]"\t"b[c[i]]}' file
123 1:2334523 yes 1
127 1:332443 yes 2
113 1:332443 no 2
115 1:55434 no 3
115 1:55434 no 3
115 1:55434 yes 3
This store all the data inn to an array a. Count the number of fields 2 in array b, then store index inn array c.
At then end, do print out the arrays.

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

Handling ties when ranking in bash

Let's say I have a list of numbers that are already sorted as below
100
222
343
423
423
500
What I want is to create a rank field such that same values are assigned the same rank
100 1
222 2
343 3
423 4
423 4
500 5
I have been using the following piece of code to mimic a rank field
awk '{print $0, NR}' file
That gives me below, but it's technically a rownumber.
100 1
222 2
343 3
423 4
423 5
500 6
How do I go about this? I am an absolute beginner in bash so would really appreciate if you could add a little explanation for learning sake.
That's a job for awk:
$ awk '{if($0!=p)++r;print $0,r;p=$0}' file
Output:
100 1
222 2
343 3
423 4
423 4
500 5
Explained:
$ awk '{ # using awk
if($0!=p) # if the value does not equal the previous value
++r # increase the rank
print $0,r # output value and rank
p=$0 # store value for next round
}' file
Could you please try following.
awk 'prev==$0{--count} {print $0,++count;prev=$1}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
prev==$0 ##Checking condition if variable prev is equal to current line then do following.
{
--count ##Subtract count variable with 1 here.
}
{
print $0,++count ##Printing current line and variable count with increasing value of it.
prev=$1 ##Setting value of prev to 1st field of current line.
}
' Input_file ##Mentioning Input_file name here.
another awk
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
where the file is not need to be sorted, for example after adding a new 423 at the end of the file
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
423 4
increment the rank counter a for new value observed, otherwise use the registered value for the key. since c is initialized to zero, pre-increment the value. This will use the same rank value for the same key regardless or the position.

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

search for a string , and add if it matches

I have a file that has 2 columns as given below....
101 6
102 23
103 45
109 36
101 42
108 21
102 24
109 67
and so on......
I want to write a script that adds the values from 2nd column if their corresponding first column matches
for example add all 2nd column values if it's 1st column is 101
add all 2nd column values if it's 1st colummn is 102
add all 2nd column values if it's 1st colummn is 103 and so on ...
i wrote my script like this , but i'm not getting the correct result
awk '{print $1}' data.txt > col1.txt
while read line
do
awk ' if [$1 == $line] sum+=$2; END {print "Sum for time stamp", $line"=", sum}; sum=0' data.txt
done < col1.txt
awk '{array[$1]+=$2} END { for (i in array) {print "Sum for time stamp",i,"=", array[i]}}' data.txt
Pure Bash :
declare -a sum
while read -a line ; do
(( sum[${line[0]}] += line[1] ))
done < "$infile"
for index in ${!sum[#]}; do
echo -e "$index ${sum[$index]}"
done
The output:
101 48
102 47
103 45
108 21
109 103

Resources