uniq -c unable to count unique lines - shell

I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?

uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do

awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.

Related

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

BASH choosing and counting distinct based on two column

Hey guys so i got this dummy data:
115,IROM,1
125,FOLCOM,1
135,SE,1
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
144,BLIZZARC,1
166,STEAD,3
166,STEAD,3
166,STEAD,3
168,BANDOI,1
179,FOX,1
199,C4,2
199,C4,2
Desired output:
IROM,1
FOLCOM,1
SE,1
ATLUZ,3
BLIZZARC,1
STEAD,1
BANDOI,1
FOX,1
C4,1
which comes from counting the distinct game id (the 115,125,etc). so for example the
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
Will be
ATLUZ,3
Since it have 3 distinct game id
I tried using
cut -d',' -f 2 game.csv|uniq -c
Where i got the following output
1 IROM
1 FOLCOM
1 SE
5 ATLUZ
1 BLIZZARC COMP
3 STEAD
1 BANDOI
1 FOX
2 C4
How do i fix this ? using bash ?
Before executing the cut command, do a uniq. This will remove the redundant lines and then you follow your command, i.e. apply cut to extract 2 field and do uniq -c to count character
uniq game.csv | cut -d',' -f 2 | uniq -c
Could you please try following too in a single awk.
awk -F, '
!a[$1,$2,$3]++{
b[$1,$2,$3]++
}
!f[$2]++{
g[++count]=$2
}
END{
for(i in b){
split(i,array,",")
c[array[2]]++
}
for(q=1;q<=count;q++){
print c[g[q]],g[q]
}
}' SUBSEP="," Input_file
It will give the order of output same as Input_file's 2nd field occurrence as follows.
1 IROM
1 FOLCOM
1 SE
3 ATLUZ
1 BLIZZARC
1 STEAD
1 BANDOI
1 FOX
1 C4
Using GNU datamash:
datamash -t, --sort --group 2 countunique 1 < input
Using awk:
awk -F, '!a[$1,$2]++{b[$2]++}END{for(i in b)print i FS b[i]}' input
Using sort, cut, uniq:
sort -u -t, -k2,2 -k1,1 input | cut -d, -f2 | uniq -c
Test run:
$ cat input
111,ATLUZ,1
121,ATLUZ,1
121,ATLUZ,2
142,ATLUZ,2
115,IROM,1
142,ATLUZ,2
$ datamash -t, --sort --group 2 countunique 1 < input
ATLUZ,3
IROM,1
As you can see, 121,ATLUZ,1 and 121,ATLUZ,2 are correctly considered to be just one game ID.
Less elegant, but you may use awk as well. If it is not granted that the same ID+NAME combos will always come consecutively, you have to count each by reading the whole file before output:
awk -F, '{c[$1,$2]+=1}END{for (ck in c){split(ck,ca,SUBSEP); print ca[2];g[ca[2]]+=1}for(gk in g){print gk,g[gk]}}' game.csv
This will count first every [COL1,COL2] pairs then for each COL2 it counts how many distinct [COL1,COL2] pairs are nonzero.
This also does the trick. The only thing is that your output is not sorted.
awk 'BEGIN{ FS = OFS = "," }{ a[$2 FS $1] }END{ for ( i in a ){ split(i, b, "," ); c[b[1]]++ } for ( i in c ) print i, c[i] }' yourfile
Output:
BANDOI,1
C4,1
STEAD,1
BLIZZARC,1
FOLCOM,1
ATLUZ,3
SE,1
IROM,1
FOX,1

how awk takes the result of a unix command as a parameter?

Say there is an input file with tabs delimited field, the first field is integer
1 abc
1 def
1 ghi
1 lalala
1 heyhey
2 ahb
2 bbh
3 chch
3 chchch
3 oiohho
3 nonon
3 halal
3 whatever
First, i need to compute the counts of the unique values in the first field, that will be:
5 for 1, 2 for 2, and 6 for 3
Then I need to find the max of these counts, in this case, it's 6.
Now i need to pass "6" to another awk script as a parmeter.
I know i can use command below to get a list of count:
cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort
but how do i get the first count number and pass it to the next awk command as a parameter not as an input file?
This is nothing very specific for awk.
Either a program can read from stdin, then you can pass the input with a pipe:
prg1 | prg2
or your program expects input as parameter, then you use
prg2 $(prg1)
Note that in both cases prg1 is processed before prg2.
Some programs allow both possibilities, while a huge amount of data is rarely passed as argument.
This AWK script replaces your whole pipeline:
awk -v parameter="$(awk '{a[$1]++} END {for (i in a) {if (a[i] > max) {max = a[i]}}; print max}' inputfile)" '{print parameter}' otherfile
where '{print parameter}' is a standin for your other AWK script and "otherfile" is the input for that script.
Note: It is extremely likely that the two AWK scripts could be combined into one which would be less of a hack than doing it in a way such as that outlined in your question (awk feeding awk).
You can use the shell's $() command substitution:
awk -f script -v num=$(cut -f1 input.txt | sort | uniq -c | awk -F ' ' '{print $1}' | sort | tail -1) < input_file
(I added the tail -1 to ensure that at most one line is used.)

Cut | Sort | Uniq -d -c | but?

The given file is in the below format.
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.
3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.
By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...
What I have done is..
awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c
but this needs a point back again to the original file to get the lines for the dup occurences. ..
let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
sort has an option -k
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
uniq has an option -f
-f, --skip-fields=N
avoid comparing the first N fields
so sort and uniq with field numbers(count NUM and test this cmd yourself, plz)
awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
Using awk's associative arrays is a handy way to find unique/duplicate rows:
awk '
BEGIN {FS = OFS = ","}
{
key = $1 FS $2 FS $5 FS $14
if (key in count)
count[key]++
else {
count[key] = 1
line[key] = $0
}
}
END {for (key in count) print count[key], line[key]}
' filename
SYNTAX :
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt
Calculation:
sym#localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS
sym#localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

Remove all occurrences of a duplicate line

If I want to remove lines where certain fields are duplicated then I use sort -u -k n,n.
But this keeps one occurrence. If I want to remove all occurrences of the duplicate is there any quick bash or awk way to do this?
Eg I have:
1 apple 30
2 banana 21
3 apple 9
4 mango 2
I want:
2 banana 21
4 mango 2
I will presort and then use a hash in perl but for v. large files this is going to be slow.
This will keep your output in the same order as your input:
awk '{seen[$2]++; a[++count]=$0; key[count]=$2} END {for (i=1;i<=count;i++) if (seen[key[i]] == 1) print a[i]}' inputfile
Try sort -k <your fields> | awk '{print $3, $1, $2}' | uniq -f2 -u | awk '{print $2, $3, $1}' to remove all lines that are duplicated (without keeping any copies). If you don't need the last field, change that first awk command to just cut -f 1-5 -d ' ', change the -f2 in uniq to -f1, and remove the second awk command.

Resources