Remove all occurrences of a duplicate line - bash

If I want to remove lines where certain fields are duplicated then I use sort -u -k n,n.
But this keeps one occurrence. If I want to remove all occurrences of the duplicate is there any quick bash or awk way to do this?
Eg I have:
1 apple 30
2 banana 21
3 apple 9
4 mango 2
I want:
2 banana 21
4 mango 2
I will presort and then use a hash in perl but for v. large files this is going to be slow.

This will keep your output in the same order as your input:
awk '{seen[$2]++; a[++count]=$0; key[count]=$2} END {for (i=1;i<=count;i++) if (seen[key[i]] == 1) print a[i]}' inputfile

Try sort -k <your fields> | awk '{print $3, $1, $2}' | uniq -f2 -u | awk '{print $2, $3, $1}' to remove all lines that are duplicated (without keeping any copies). If you don't need the last field, change that first awk command to just cut -f 1-5 -d ' ', change the -f2 in uniq to -f1, and remove the second awk command.

Related

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

Use of Awk filter to get the students records details in descending order of total score

Student details are stored in a file system as follows:
Roll_no,name,socre1,score2
101,ABC,50,55
102,XYZ,48,54
103,CWE,42,34
104,ZSE,65,72
105,FGR,31,45
106,QWE,68,45
Q.Write the unix command to display Roll_no and name of the student whose total score is greater than 100 the student details are to be displayed sorted in descending order of the total score.
total score as to be calculated as follows :-
totalscore=score1+score2
file also content the header(Roll_no,name,socre1,score2)
My solution:
awk 'BEGIN {FS=",";OFS=" "} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,h[i]}}' stu.txt| sort -rk 2n
I am not getting how to get sorting according to total score?
please help guys!
output:-
104 ZSE
106 QWE
101 ABC
102 XYZ
Could you please try following. To keep it simple in calculation(1st get total of numbers for all lines which are greater than 100 Then sort it reverse order by total as per OP then print only first 2 columns by cut)
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -t, -nr -k3 |
cut -d',' -f 1-2
OR in case you want output in space delimiter in output then try following.
awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -nr -k3 |
cut -d' ' -f 1-2
Explanation: Adding detailed explanation for above.
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file | ##Starting awk program setting FS, OFS as comma. Then checking 3rd+4th column sum is greater than 100 then printing 1st, 2nd field along with sum of 3rd and 4th field here. Now passing its output as input to next command.
sort -t, -nr -k3 | ##Sorting output with setting delimiter as comma and sorting it reverse order witg 3rd column here, sending output as input to next command.
cut -d',' -f 1-2 ##Getting first 2 fields by setting delimiter comma here, to get name and roll number here.
OR
sort -t, -nr -k3 < <(awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d',' -f 1-2
OR in case you need output as space delimited then try following.
sort -nr -k3 < <(awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d' ' -f 1-2
$ awk 'BEGIN {OFS=FS=","}
NR==1 {print $0, "total"; next}
{if(($5=$3+$4)>100) print | "sort -t, -k5nr"}' file
Roll_no,name,socre1,score2,total
104,ZSE,65,72,137
106,QWE,68,45,113
101,ABC,50,55,105
102,XYZ,48,54,102
without header and individual scores
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100{print $1,$2,$3}' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
or
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100 && NF--' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
without the final score and not comma delimited
$ awk -F, 'NR>1 && ($3+=$4)>100 && NF--' file | sort -k3nr | cut -d' ' -f1,2
104 ZSE
106 QWE
101 ABC
102 XYZ
reads as written
if line number is greater than one (skip header) AND
if field 3 + field 4 > 100 (assigned back to field 3) then
if both conditions are satisfied decrement field count so that last field won't be printed.
sort the results based on the third field,
remove the last field.
you were close:
awk 'BEGIN {FS=OFS=","} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,s[i]}}' stu.txt| sort -rk 2n

How to obtain the value for the 3rd one from the bottom in bash?

I have a line like this
3672975 3672978 3672979
awk '{print $1}' will return the first number 3672975
If I still want the first number, but indicating it is the 3rd one from the bottom, how should I adjust awk '{print $-3}'?
The reason is, I have hundreds of numbers, and I always want to obtain the 3rd one from the bottom.
Can I use awk to obtain the total number of items first, then do the subtraction?
$NF is the last field, $(NF-1) is the one before the last etc., so:
$ awk '{print $(NF-2)}'
for example:
$ echo 3672975 3672978 3672979 | awk '{print $(NF-2)}'
3672975
Edit:
$ echo 1 10 100 | awk '{print $(NF-2)}'
1
or with cut and rev
echo 1 2 3 4 | rev | cut -d' ' -f 3 | rev
2

uniq -c unable to count unique lines

I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?
uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do
awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.

how awk takes the result of a unix command as a parameter?

Say there is an input file with tabs delimited field, the first field is integer
1 abc
1 def
1 ghi
1 lalala
1 heyhey
2 ahb
2 bbh
3 chch
3 chchch
3 oiohho
3 nonon
3 halal
3 whatever
First, i need to compute the counts of the unique values in the first field, that will be:
5 for 1, 2 for 2, and 6 for 3
Then I need to find the max of these counts, in this case, it's 6.
Now i need to pass "6" to another awk script as a parmeter.
I know i can use command below to get a list of count:
cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort
but how do i get the first count number and pass it to the next awk command as a parameter not as an input file?
This is nothing very specific for awk.
Either a program can read from stdin, then you can pass the input with a pipe:
prg1 | prg2
or your program expects input as parameter, then you use
prg2 $(prg1)
Note that in both cases prg1 is processed before prg2.
Some programs allow both possibilities, while a huge amount of data is rarely passed as argument.
This AWK script replaces your whole pipeline:
awk -v parameter="$(awk '{a[$1]++} END {for (i in a) {if (a[i] > max) {max = a[i]}}; print max}' inputfile)" '{print parameter}' otherfile
where '{print parameter}' is a standin for your other AWK script and "otherfile" is the input for that script.
Note: It is extremely likely that the two AWK scripts could be combined into one which would be less of a hack than doing it in a way such as that outlined in your question (awk feeding awk).
You can use the shell's $() command substitution:
awk -f script -v num=$(cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort | tail -1) < input_file
(I added the tail -1 to ensure that at most one line is used.)

Resources