How can we sum the values group by from file using shell script - bash

I have a file where I have student Roll no, Name, Subject, Obtain Marks and Total Marks data:
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
How can i get Group by sum (total obtained marks) of every student in shell Shell? I want this output:
William 190
John 250
i have tried this:
cat student.txt | awk '{sum += $14}END{print sum" "$1}' | sort | uniq -c | sort -nr | head -n 10
This is not working link group by sum.

With one awk command:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file
Output
William 190
John 250
If you want to sort the output, you can pipe to sort, e.g. descending by numerical second field:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort -rnk2
or ascending by student name:
awk '{a[$2]+=$4} END {for (i in a) print i,a[i]}' file | sort

You need to use associative array in awk.
Try
awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}'
a[$2]=a[$2]+$4 Create associate array with $2 as index and sum of values $4 as value
END <-- Process all records
for (i in a) print i, a[i] <-- Print index and value of array
Demo :
$awk '{ a[$2]=a[$2]+$4 } END {for (i in a) print i, a[i]}' temp.txt
William 190
John 250
$cat temp.txt
10 William English 80 100
10 William Math 50 100
10 William IT 60 100
11 John English 90 100
11 John Math 75 100
11 John IT 85 100
$

Related

How to get the number of different lines in bash [duplicate]

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2
how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2
uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).
if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1
Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2
In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done
In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

A UNIX Command to Find the Name of the Student who has the Second Highest Score

I am new to Unix Programming. Could you please help me to solve the question.
For example, If the input file has the below content
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
The output will be
ABC
I tried something like this
sort -k3,3 -rn -t" " | head -n2 | awk '{print $2}'
Using awk
awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}'
Demo:
$cat file.txt
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
$awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}' file.txt
ABC
$
Explanation:
NR>1 --> Skip first record
{arr[$3]=$2} --> Create associtive array with marks as index and name as value
END <-- read till end of file
n=asorti(arr,arr_sorted) <-- Sort array arr on index value(i.e marks) and save in arr_sorted. n= number of element in array
print arr[arr_sorted[n-1]]} <-- n-1 will point to second last value in arr_sorted (i.e marks) and print corresponding value from arr
Your attempt is 90% correct just a single change
Try this...it will work.
sort -k3,3 -rn -t" " | head -n1 | awk '{print $2}'
Instead of using head -n2 replace it with head -n1

Print names alphabetically and how many appearances for each name

I have a file that includes names, one on each line. I want to print the names alphabetically, but (and here is where it gets confusing at least for me) next to each name I must print the number of appearances of that name with exactly one space between the name and the number of appearances.
For example if the file includes these names:
Barry
Don
John
Sam
Harry
Don
Don
Sam
it must print
Barry 1
Don 3
Harry 1
John 1
Sam 2
Any ideas?
sort | uniq -c will get you very close, just with the columns reversed.
$ sort file | uniq -c
1 Barry
3 Don
1 Harry
1 John
2 Sam
If you really need them in the proscribed order you could swap them with awk.
$ sort test.txt | uniq -c | awk '{print $2, $1}'
Barry 1
Don 3
Harry 1
John 1
Sam 2
With awk :
% awk '{
a[$1]++
}
END{
for (i in a) {
print i, a[i]
}
}' file
Output:
Barry 1
Harry 1
Don 3
John 1
Sam 2
Given:
$ cat file
Barry
Don
John
Sam
Harry
Don
Don
Sam
You can do:
$ awk '{a[$1]++} END { for (e in a) print e, a[e] }' file | sort
Barry 1
Don 3
Harry 1
John 1
Sam 2

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

counting duplicates in a sorted sequence using command line tools

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2
how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2
uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).
if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1
Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2
In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done
In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

Resources