get unique combination of values of two columns - shell

I can get the unique values from col using below command
cut -d',' -f3 file.txt | uniq -c.
This gives me unique values in field 3.
But if I want to get unique combination of two fields, how can I get that ?
input
A,B,C
B,C,D
D,B,C
H,C,D
K,C,D
output
2 B,C
3 C,D

You can specify range of fields using -f 2-3 or -f 2,3
cut -d',' -f2-3 file.txt | sort | uniq -c
uniq does not detect repeated lines unless they are adjacent. Input should be sorted before using uniq command
Output
2 B,C
3 C,D

Another option you may find provides greater flexibility in processing the input is awk. You can use a concatenation of the fields at issue as an index for an array to sum the occurrences of each unique combination of fields and then output the results using the END rule, e.g.
awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' file
Example Use/Output
With your example file in input you would have:
$ awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' input
3 C,D
2 B,C
awk arrays are associative rather than indexed, but you can preserve the order of appearance using a 3rd array if needed. Or you can simply pipe the output to sort for whatever order you like.

Related

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

Awk to (random) sample a file by id-uniques criteria

I'm learning AWK to read a big file which format is similar to this MasterFile:
Beth|4.00|0|
Dan|3.75|0|
Kathy|4.00|10|
Mark|5.00|20|
Mary|5.50|22|
Susie|4.25|18|
Jise|5.62|0|
Mark|5.60|23.3|
Mary|8.50|42|
Susie|8.75|8.8|
Jise|3.62|0.8|
Beth|3.21|10|
Dan|8.39|20|
I would like to sample by unique values (size K) from the first column with size N (I choose it).
What I have done is following: I select unique values from first column and save it as IDfile.txt. Later, I take K random values from that archive and I match it with the MasterFile. I mean:
awk -F\| 'BEGIN{srand()}{print rand() " " $0}' IDfile | sort -n | tail -n K| awk -F'[[:blank:]|]+' 'BEGIN{OFS="|"}{$1="";sub(/\|/,"")}'1>tmp | awk -F\| 'NR==FNR{a[$1];next} {for (i in a) if(index($0,i)) print $0}' tmp MasterFile
But the output has repeated values and the result that I'd like to get is like to (assuming that K=3):
Beth|4.00|0|
Mark|5.60|23.3|
Mary|5.50|22|
I know that my code is far from efficient [or nice] and I'm open to suggestions [].
Thanks!
this is the one of the right ways to do this
$ sort -t'|' -u -k1,1 file | shuf -n3
Mark|5.00|20|
Kathy|4.00|10|
Jise|5.62|0|
change -n3 to whatever number of unique entries you need.

Sort a file in unix by the absolute value of a field

I want to sort this file by the absolute value of the Linear regression (p) column in descending order. My attempt to do this didnt quite work. Im not sure what it fails. I found this code from http://www.unix.com/shell-programming-and-scripting/168144-sort-absolute-value.html.
awk -F',' '{print ($2>=0)?$2:-$2, $0}' OFS=',' mycsv1.csv | sort -n -k8,8 | cut -d ',' -f2-
X var,Y var,MIC (strength),MIC-p^2 (nonlinearity),MAS (non-monotonicity),MEV (functionality),MCN (complexity),Linear regression (p)
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
...
Please help me to understand the awk script to sort this file.
You could use sed and sort for this and follow the #hek2mgl's very smart logic of adding and removing a field at the end to retain the original number:
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' file | sort -t, -k9,9 -nr | cut -f1-8 -d,
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' => creates field 9 as the absolute value of field 8
sort -t, -k9,9 -nr => sorts by the newly created field, numeric and descending order
cut -f1-8 -d, => removes the 9th field, restoring the output to its original format, with the desired sorting order
Here is the output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Take three steps:
(1) Temporarily create a 9th field which contains the abs value of field 8:
LC_COLLATE=C awk -F, 'NR>1{v=$NF;sub(/-/,"",v);printf "%s%s%s%s",$0,FS,v,RS}' file
^ ------ make sure this is set since sorting, especially the decimal point
depends on the local.
(2) Sort that output based on the 9th field:
command_1 | sort -t, -k9r
(3) Pipe that back to awk to remove the last field. NF-- decreases the number of fields which will effectively remove the last field. 1 is always true, that makes awk print the line:
command_2 | cut -d, -f1-8
Output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Could get awk to do it all:
awk -F, 'NR>1{n[substr($NF,1,1)=="-"?substr($NF,2):$NF]=$0}NR==1;END{asorti(n,out);for(i in out)print n[out[i]]}' file

Why does coreutils sort give a different result when I use a different field delimiter?

When using sort on the command line, why does the sorted order depend on which field delimiter I use? As an example,
$ # The test file:
$ cat test.csv
2,az,a,2
3,a,az,3
1,az,az,1
4,a,a,4
$ # sort based on fields 2 and 3, comma separated. Gives correct order.
$ LC_ALL=C sort -t, -k2,3 test.csv
4,a,a,4
3,a,az,3
2,az,a,2
1,az,az,1
$ # replace , by ~ as field separator, then sort as before. Gives incorrect order.
$ tr "," "~" < test.csv | LC_ALL=C sort -t"~" -k2,3
2~az~a~2
1~az~az~1
4~a~a~4
3~a~az~3
The second case not only gets the ordering wrong, but is inconsistent between field 2 (where az < a) and field 3 (where a < az).
There is a mistake in -k2,3. That means that sort should sort starting at the 2nd field and ending at the 3rd field. That means that the delimiter between them is also part of what is to be sorted and therefore counts as character. That's why you encounter different sorts with different delimiters.
What you want is the following:
LC_ALL=C sort -t"," -k2,2 -k3,3 file
And:
tr "," "~" < file | LC_ALL=C sort -t"~" -k2,2 -k3,3
That means sort should sort the 2nd field and is the 2nd field has dublicates sort the 3rd field.

How to remove duplicates by column (inverse ordering)

I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!
This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.
awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac

Resources