Sum values based on word in line in bash [duplicate] - bash

This question already has answers here:
Calculate sum of column using reference of other column in awk
(2 answers)
Closed 4 years ago.
I have a script which outputs a number followed by a space and then a letter, e.g.:
2685 R
2435 M
984 D
924 A
353 R
291 A
593 D
577 A
476 R
769 M
629 R
179 D
Is there an easy way to sum the numbers based on the letter/word after the number? I would like this to be the output:
1792 A
1756 D
3204 M
4143 R
I have tried sort, awk and sed combinations but can't find a nice solution to sum.
Thanks.

Following awk may help you here.
awk '{a[$2]+=$1} END{for(i in a){print a[i],i}}' Input_file
In case you want to sort it as per 2nd field then append | sort -k2 to above code too.

Related

Handling ties when ranking in bash

Let's say I have a list of numbers that are already sorted as below
100
222
343
423
423
500
What I want is to create a rank field such that same values are assigned the same rank
100 1
222 2
343 3
423 4
423 4
500 5
I have been using the following piece of code to mimic a rank field
awk '{print $0, NR}' file
That gives me below, but it's technically a rownumber.
100 1
222 2
343 3
423 4
423 5
500 6
How do I go about this? I am an absolute beginner in bash so would really appreciate if you could add a little explanation for learning sake.
That's a job for awk:
$ awk '{if($0!=p)++r;print $0,r;p=$0}' file
Output:
100 1
222 2
343 3
423 4
423 4
500 5
Explained:
$ awk '{ # using awk
if($0!=p) # if the value does not equal the previous value
++r # increase the rank
print $0,r # output value and rank
p=$0 # store value for next round
}' file
Could you please try following.
awk 'prev==$0{--count} {print $0,++count;prev=$1}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
prev==$0 ##Checking condition if variable prev is equal to current line then do following.
{
--count ##Subtract count variable with 1 here.
}
{
print $0,++count ##Printing current line and variable count with increasing value of it.
prev=$1 ##Setting value of prev to 1st field of current line.
}
' Input_file ##Mentioning Input_file name here.
another awk
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
where the file is not need to be sorted, for example after adding a new 423 at the end of the file
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
423 4
increment the rank counter a for new value observed, otherwise use the registered value for the key. since c is initialized to zero, pre-increment the value. This will use the same rank value for the same key regardless or the position.

How to do cumulative and consecutive sums for every column in a tab file (UNIX environment)

I have a tabulated file something like that
Q8VYA50 210 69 2 8 3
Q8VYA50 208 69 1 2 8 3
Q9C8G30 316 182 4 4 7
P335430 657 98 1 10 7
That I would like to do is to apply a cumulative sum from the 4rd column up to NF and print in every column the result of the sum for this column and the original value of previous columns if any. So that, the desired output would be
Q8VYA50 210 69 2 10 13
Q8VYA50 208 69 1 3 11 14
Q9C8G30 316 182 4 8 15
P335430 657 98 1 11 18
I have tried to do it through different ways by means of sum function inside an awk script including for-loop specifying the fields where must apply the cumulative sum. However, the result obtained is wrong.
Are there some way to do it correctly by Unix (Bash)? Thanks in advance!
This is one way I have tried to do #Inian
gawk 'BEGIN {FS=OFS="\t"} {
for (i=4;i<=NF;i++)
{
sum[i]+=$i; print $1,$2,$3,$i
}
}' "input_file"
Other way is to do for every column manually. $4,$5+$4,$6+$5+$4,$7+$6+$5+$4 and so on, but I think is a "seedy" method.
Following awk may help you here.
awk '{for(i=5;i<=NF;i++){$i+=$(i-1)}} 1' OFS="\t" Input_file

getting the sum of the out put in unix [duplicate]

This question already has answers here:
Summing values of a column using awk command
(2 answers)
Closed 5 years ago.
I am trying to get the sum of my output in bash shell using only awk. One of the problems I am getting is that I only need to use awk in this.
This is the code I am using for getting the output:
awk '{print substr($7, 9, 4)}' emp.txt
This is the output I am getting: (output omitted)
7606
6498
7947
4044
1657
3872
4834
8463
9280
2789
9104
this is how I am trying to do the sum of the numbers: awk '(s = s + substr($7, 9, 4)) {print s}' emp.txt
The problem is that it is not giving me the right output (which should be 9942686) but instead giving me the series sum (as shown below).
(output omitted)
9890696
9898643
9902687
9904344
9908216
9913050
9921513
9930793
9933582
9942686
Am I using the code the wrong way? Or is there any other method of doing it with awk and I am doing it the wrong way?
Here is the sample file I am working on:
Brynlee Watkins F 55 Married 2016 778-555-6498 62861
Malcolm Curry M 24 Married 2016 604-555-7947 54647
Aylin Blake F 45 Married 2015 236-555-4044 80817
Mckinley Hodges F 50 Married 2015 604-555-1657 46316
Rylan Dorsey F 51 Married 2017 778-555-3872 77160
Taylor Clarke M 23 Married 2015 604-555-4834 46624
Vivaan Hooper M 26 Married 2016 778-555-8463 80010
Gibson Rowland M 42 Married 2017 236-555-9280 59874
Alyson Mahoney F 51 Single 2017 778-555-2789 71394
Catalina Frazier F 53 Married 2016 604-555-9104 79364
EDIT: I want to get the sum of the numbers that are repeating in the output. Let's say the repeating numbers are 4826 and 0028 in the output and both of them repeated 2 times. I only want the sum of these numbers (each repetition must be counted as the individual. hence these are counted as 4). So the desired output for these 4 numbers shall be 9708
Will Duffy M 33 Single 2017 236-555-4826 47394
Nolan Reed M 27 Single 2015 604-555-0028 46622
Anya Horn F 54 Married 2017 236-555-4826 73270
Cynthia Davenport F 29 Married 2015 778-555-0028 59687
Oscar Medina M 43 Married 2016 778-555-7864 73688
Angelina Herrera F 37 Married 2017 604-555-7910 82061
Peyton Reyes F 35 Married 2017 236-555-8046 51920
END { print s }
Since you only need the total sum printed once, do it under the END pattern.
awk '{s = s + substr($7, 9, 4)} END {print s}' emp.txt
Could you please try following awk and let me know if this helps you. It will look always look for last digits after -:
awk -F' |-' '{sum+=$(NF-1)} END{print sum}' Input_file
EDIT:
awk -F' |-' '
{
++a[$(NF-1)];
b[$(NF-1)]=b[$(NF-1)]?b[$(NF-1)]+$(NF-1):$(NF-1)
}
END{
for(i in a){
if(a[i]>1){
print i,b[i]}
}}
' Input_file
Output will be as follows:
4826 9652
0028 56

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Resources