Use of Awk filter to get the students records details in descending order of total score - shell

Student details are stored in a file system as follows:
Roll_no,name,socre1,score2
101,ABC,50,55
102,XYZ,48,54
103,CWE,42,34
104,ZSE,65,72
105,FGR,31,45
106,QWE,68,45
Q.Write the unix command to display Roll_no and name of the student whose total score is greater than 100 the student details are to be displayed sorted in descending order of the total score.
total score as to be calculated as follows :-
totalscore=score1+score2
file also content the header(Roll_no,name,socre1,score2)
My solution:
awk 'BEGIN {FS=",";OFS=" "} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,h[i]}}' stu.txt| sort -rk 2n
I am not getting how to get sorting according to total score?
please help guys!
output:-
104 ZSE
106 QWE
101 ABC
102 XYZ

Could you please try following. To keep it simple in calculation(1st get total of numbers for all lines which are greater than 100 Then sort it reverse order by total as per OP then print only first 2 columns by cut)
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -t, -nr -k3 |
cut -d',' -f 1-2
OR in case you want output in space delimiter in output then try following.
awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -nr -k3 |
cut -d' ' -f 1-2
Explanation: Adding detailed explanation for above.
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file | ##Starting awk program setting FS, OFS as comma. Then checking 3rd+4th column sum is greater than 100 then printing 1st, 2nd field along with sum of 3rd and 4th field here. Now passing its output as input to next command.
sort -t, -nr -k3 | ##Sorting output with setting delimiter as comma and sorting it reverse order witg 3rd column here, sending output as input to next command.
cut -d',' -f 1-2 ##Getting first 2 fields by setting delimiter comma here, to get name and roll number here.
OR
sort -t, -nr -k3 < <(awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d',' -f 1-2
OR in case you need output as space delimited then try following.
sort -nr -k3 < <(awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d' ' -f 1-2

$ awk 'BEGIN {OFS=FS=","}
NR==1 {print $0, "total"; next}
{if(($5=$3+$4)>100) print | "sort -t, -k5nr"}' file
Roll_no,name,socre1,score2,total
104,ZSE,65,72,137
106,QWE,68,45,113
101,ABC,50,55,105
102,XYZ,48,54,102
without header and individual scores
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100{print $1,$2,$3}' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
or
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100 && NF--' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
without the final score and not comma delimited
$ awk -F, 'NR>1 && ($3+=$4)>100 && NF--' file | sort -k3nr | cut -d' ' -f1,2
104 ZSE
106 QWE
101 ABC
102 XYZ
reads as written
if line number is greater than one (skip header) AND
if field 3 + field 4 > 100 (assigned back to field 3) then
if both conditions are satisfied decrement field count so that last field won't be printed.
sort the results based on the third field,
remove the last field.

you were close:
awk 'BEGIN {FS=OFS=","} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,s[i]}}' stu.txt| sort -rk 2n

Related

Cut and sort delimited dates from stdout via pipe

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz
To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

Unix: Get the latest entry from the file

I have a file where there are name and time. I want to keep the entry only with the latest time. How do I do it?
for example:
>cat user.txt
"a","03-May-13
"b","13-May-13
"a","13-Aug-13
"a","13-May-13
I am using command sort -u user.txt. It is giving the following output:
"a","11-May-13
"a","13-Aug-13
"a","13-May-13
"b","13-May-13
but I want the following output.
"a","13-Aug-13
"b","13-May-13
Can someone help?
Thanks.
Try this:
sort -t, -k2 user.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Explanation:
Sort the entries by the date field in ascending order, pipe the sorted result to awk, which simply uses the first field as a key, so only the last entry of the entries with an identical key will be kept and finally output.
EDIT
Okay, so I can't sort the entries lexicographically. the date need to be converted to timestamp so it can be compared numerically, use the following:
awk -F",\"" '{ cmd=" date --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
If you are using MacOS, use gdate instead:
awk -F",\"" '{ cmd=" gdate --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
I think you need to sort year, month and day.
Can you try this
awk -F"\"" '{print $2"-"$4}' data.txt | sort -t- -k4 -k3M -k2 | awk -F- '{kv[$1]=$2"-"$3"-"$4}END{for(k in kv){print k,kv[k]}}'
For me this is doing the job. I am sorting on the Month and then applying the logic that #neevek used. Till now I am unable to find a case that fails this. But I am not sure if this is a full proof solution.
sort -t- -k2 -M user1.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Can someone tell me if this solution has any issues?
How about this?
grep `cut -d'"' -f4 user.txt | sort -t- -k 3 -k 2M -k 1n | tail -1` user.txt
Explaining: using sort as you have done, get the latest entry with tail -1, extract that date (second column when cutting with a comma delimiter) and then sort and grep on that.
edit: fixed to sort via month.

Cut | Sort | Uniq -d -c | but?

The given file is in the below format.
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.
3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.
By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...
What I have done is..
awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c
but this needs a point back again to the original file to get the lines for the dup occurences. ..
let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
sort has an option -k
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
uniq has an option -f
-f, --skip-fields=N
avoid comparing the first N fields
so sort and uniq with field numbers(count NUM and test this cmd yourself, plz)
awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
Using awk's associative arrays is a handy way to find unique/duplicate rows:
awk '
BEGIN {FS = OFS = ","}
{
key = $1 FS $2 FS $5 FS $14
if (key in count)
count[key]++
else {
count[key] = 1
line[key] = $0
}
}
END {for (key in count) print count[key], line[key]}
' filename
SYNTAX :
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt
Calculation:
sym#localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS
sym#localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

Remove all occurrences of a duplicate line

If I want to remove lines where certain fields are duplicated then I use sort -u -k n,n.
But this keeps one occurrence. If I want to remove all occurrences of the duplicate is there any quick bash or awk way to do this?
Eg I have:
1 apple 30
2 banana 21
3 apple 9
4 mango 2
I want:
2 banana 21
4 mango 2
I will presort and then use a hash in perl but for v. large files this is going to be slow.
This will keep your output in the same order as your input:
awk '{seen[$2]++; a[++count]=$0; key[count]=$2} END {for (i=1;i<=count;i++) if (seen[key[i]] == 1) print a[i]}' inputfile
Try sort -k <your fields> | awk '{print $3, $1, $2}' | uniq -f2 -u | awk '{print $2, $3, $1}' to remove all lines that are duplicated (without keeping any copies). If you don't need the last field, change that first awk command to just cut -f 1-5 -d ' ', change the -f2 in uniq to -f1, and remove the second awk command.

Resources