bash sort on multiple fields and deduplicating - bash

I want to sort data like the below content on the first field first and then on the date in the third field. Then keep only the latest for each ID(field 1) - irrespective of the second field.
id1,description1,2013/11/20
id2,description2,2013/06/11
id2,description3,2012/10/28
id2,description4,2011/12/04
id3,description5,2014/02/09
id3,description6,2013/12/05
id4,description7,2013/12/05
id5,description8,2013/08/14
So the expected output will be
id1,description1,2013/11/20
id2,description2,2013/06/11
id3,description5,2014/02/09
id4,description7,2013/12/05
id5,description8,2013/08/14
Thanks
Jomon

You can use this awk:
> cat file
id1,description1,2013/11/20
id1,description1,2013/11/19
id2,description2,2013/06/11
id2,description3,2012/10/28
id2,description4,2011/12/04
id3,description5,2014/02/09
id3,description6,2013/12/05
id4,description7,2013/12/05
id5,description8,2013/08/14
> sort -t, -k1,1 -k3,3r file | awk -F, '!a[$1]++'
id1,description1,2013/11/20
id2,description2,2013/06/11
id3,description5,2014/02/09
id4,description7,2013/12/05
id5,description8,2013/08/14

Call sort twice; the first time, sort by the date. On the second call, sort uniquely on the first field, but do so stably so that items with the same id remain sorted by date.
sort -t, -k3,3r data.txt | sort -t, -su -k1,1

Try this:
cat file |sort -u|awk -F, '{if(map[$1] == ""){print $0; map[$1]="printed"}}'
Explanation:
I use sort to sort (well could not be more simple)
And I use awk to store in a map if the first column item was already printed.
If not (map[$1] == "") I print and store "printed" into map[$1] (so next time it won't be equal to "" for the current value of $1).

Related

Unix- smallest value in column

I have a txt file which is just three columns of numbers separated by space. I need to use "sort" to display the smallest value of column 3 and only that value.
I tried
sort -k3 file.txt|head -1
but it shows the first value of all three columns.
This is what's expected. sort -k3 file.txt | head -1 says "show me the first line of output"
Use just plain sort -k3 file.txt | head to get the first 10 lines.
What were you expecting or wanting?
In response to the comment: No worries! We're all beginners at the beginning :-)
sort -r file.txt will sort in reverse order, and as #shellter says, sort -r -k3 file.txt | awk 'NR==1{print $3} will print the third value on the first line.

Sort by length of column

Need help sorting by length of the 4th column with a Unix command.
Example data (all data is made up, and not actual).
5032:Stack:overflows#business.com:123:JamesPeterson
3200:Admin:admin#me.com:12ej3dij23i2j32:AdminAdmin
1024:GregoryJames:greg#admin.com:12329232:GregJames
Preferred format (Because the length of 4th column is the longest).
3200:Admin:admin#me.com:12ej3dij23i2j32:AdminAdmin
1024:GregoryJames:greg#admin.com:12329232:GregJames
5032:Stack:overflows#business.com:123:JamesPeterson
Use awk to add a column containing the length of the column, sort by that, then remove it.
awk -F: '{printf("%d %s\n", length($4), $0)}' input.txt | sort -nr | cut -d' ' -f2- > output.txt

Sort a file in unix by the absolute value of a field

I want to sort this file by the absolute value of the Linear regression (p) column in descending order. My attempt to do this didnt quite work. Im not sure what it fails. I found this code from http://www.unix.com/shell-programming-and-scripting/168144-sort-absolute-value.html.
awk -F',' '{print ($2>=0)?$2:-$2, $0}' OFS=',' mycsv1.csv | sort -n -k8,8 | cut -d ',' -f2-
X var,Y var,MIC (strength),MIC-p^2 (nonlinearity),MAS (non-monotonicity),MEV (functionality),MCN (complexity),Linear regression (p)
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
...
Please help me to understand the awk script to sort this file.
You could use sed and sort for this and follow the #hek2mgl's very smart logic of adding and removing a field at the end to retain the original number:
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' file | sort -t, -k9,9 -nr | cut -f1-8 -d,
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' => creates field 9 as the absolute value of field 8
sort -t, -k9,9 -nr => sorts by the newly created field, numeric and descending order
cut -f1-8 -d, => removes the 9th field, restoring the output to its original format, with the desired sorting order
Here is the output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Take three steps:
(1) Temporarily create a 9th field which contains the abs value of field 8:
LC_COLLATE=C awk -F, 'NR>1{v=$NF;sub(/-/,"",v);printf "%s%s%s%s",$0,FS,v,RS}' file
^ ------ make sure this is set since sorting, especially the decimal point
depends on the local.
(2) Sort that output based on the 9th field:
command_1 | sort -t, -k9r
(3) Pipe that back to awk to remove the last field. NF-- decreases the number of fields which will effectively remove the last field. 1 is always true, that makes awk print the line:
command_2 | cut -d, -f1-8
Output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Could get awk to do it all:
awk -F, 'NR>1{n[substr($NF,1,1)=="-"?substr($NF,2):$NF]=$0}NR==1;END{asorti(n,out);for(i in out)print n[out[i]]}' file

find uniq lines in file, but ignore certain columns

So I've looked around now for a few hours but haven't found anything helpful.
I want to sort through a file that has a large number of lines formatted like
Values1, values2, values3, values4, values5, values6,
but I want to return only the lines that are uniquely related to
Values1, values2, values3, values6
As in I have multiple instances Values1, values2, values3, values6 where their only difference is values4, values5 and I don't want to return those, rather just one instance of the line (preferably the line pertaining to the largest value of values4, values5 but thats not a big deal)
I have tried using
uniq -s ##
but that doesn't work because my values lengths are variable.
I have also tried
sort -u -k 1,3
but that doesn't seem to work either.
mainly my issue is my values are variable in length, I'm not that concerned with sorting by values6 but it would be nice.
any help would be greatly appreciated
With awk, you can print the first time the "key" is seen:
awk '
{ key = $1 OFS $2 OFS $3 OFS $6 }
!seen[key]++
' file
The magic !seen[key]++ is an awk idiom. It returns true only the first time that key is encountered. It then increments the values so that it won't be true for any subsequent encounter.
alternative to awk
cut -d" " -f1-3,6 filename | sort -u
extract only required fields, sort unique
If you absolutely mustn't use the very clean cut method as suggested by #karafka, then with a csv file as input, you could use uniq -f <num> which skips the first <num> columns for the uniqueness comparison.
Since uniq expects blanks as separators we need to change this and also reorder the columns to meet your requirements.
sed 's/,/\t/g' textfile.csv | awk '{ print $4,$5,$1,$2,$3,$6}' | \
sort -k3,4,5,6 | uniq -f 2 | \
awk 'BEGIN{OFS=",";} { print $3,$4,$5,$1,$2,$6}'
This way only first line values (after sort) of $4 and $5 will be printed.

uniq -u with specific column

I want to print lines based on value in particular column that appear only once. In example below, val2 and val3 appear only once.
Input
val1,1
val2,2
val1,3
val3,4
Output
val2,2
val3,4
uniq -u does not seem to have option of specifying a column. I also tried sort -t, -k1,1 -u but that prints every row once.
awk -F, '{c[$1]++; t[$1]=$0} END {for(k in c) {if (c[k]==1) print t[k]}}'
Sounds like a problem for awk, assume that the command that produces
val1,1
val2,2
val1,3
val3,2
Is called foo, then pipe it into awk like so:
foo | awk -F, '$2 == 2 {print}'

Resources