I want to sort this file by the absolute value of the Linear regression (p) column in descending order. My attempt to do this didnt quite work. Im not sure what it fails. I found this code from http://www.unix.com/shell-programming-and-scripting/168144-sort-absolute-value.html.
awk -F',' '{print ($2>=0)?$2:-$2, $0}' OFS=',' mycsv1.csv | sort -n -k8,8 | cut -d ',' -f2-
X var,Y var,MIC (strength),MIC-p^2 (nonlinearity),MAS (non-monotonicity),MEV (functionality),MCN (complexity),Linear regression (p)
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
...
Please help me to understand the awk script to sort this file.
You could use sed and sort for this and follow the #hek2mgl's very smart logic of adding and removing a field at the end to retain the original number:
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' file | sort -t, -k9,9 -nr | cut -f1-8 -d,
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' => creates field 9 as the absolute value of field 8
sort -t, -k9,9 -nr => sorts by the newly created field, numeric and descending order
cut -f1-8 -d, => removes the 9th field, restoring the output to its original format, with the desired sorting order
Here is the output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Take three steps:
(1) Temporarily create a 9th field which contains the abs value of field 8:
LC_COLLATE=C awk -F, 'NR>1{v=$NF;sub(/-/,"",v);printf "%s%s%s%s",$0,FS,v,RS}' file
^ ------ make sure this is set since sorting, especially the decimal point
depends on the local.
(2) Sort that output based on the 9th field:
command_1 | sort -t, -k9r
(3) Pipe that back to awk to remove the last field. NF-- decreases the number of fields which will effectively remove the last field. 1 is always true, that makes awk print the line:
command_2 | cut -d, -f1-8
Output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Could get awk to do it all:
awk -F, 'NR>1{n[substr($NF,1,1)=="-"?substr($NF,2):$NF]=$0}NR==1;END{asorti(n,out);for(i in out)print n[out[i]]}' file
So I've looked around now for a few hours but haven't found anything helpful.
I want to sort through a file that has a large number of lines formatted like
Values1, values2, values3, values4, values5, values6,
but I want to return only the lines that are uniquely related to
Values1, values2, values3, values6
As in I have multiple instances Values1, values2, values3, values6 where their only difference is values4, values5 and I don't want to return those, rather just one instance of the line (preferably the line pertaining to the largest value of values4, values5 but thats not a big deal)
I have tried using
uniq -s ##
but that doesn't work because my values lengths are variable.
I have also tried
sort -u -k 1,3
but that doesn't seem to work either.
mainly my issue is my values are variable in length, I'm not that concerned with sorting by values6 but it would be nice.
any help would be greatly appreciated
With awk, you can print the first time the "key" is seen:
awk '
{ key = $1 OFS $2 OFS $3 OFS $6 }
!seen[key]++
' file
The magic !seen[key]++ is an awk idiom. It returns true only the first time that key is encountered. It then increments the values so that it won't be true for any subsequent encounter.
alternative to awk
cut -d" " -f1-3,6 filename | sort -u
extract only required fields, sort unique
If you absolutely mustn't use the very clean cut method as suggested by #karafka, then with a csv file as input, you could use uniq -f <num> which skips the first <num> columns for the uniqueness comparison.
Since uniq expects blanks as separators we need to change this and also reorder the columns to meet your requirements.
sed 's/,/\t/g' textfile.csv | awk '{ print $4,$5,$1,$2,$3,$6}' | \
sort -k3,4,5,6 | uniq -f 2 | \
awk 'BEGIN{OFS=",";} { print $3,$4,$5,$1,$2,$6}'
This way only first line values (after sort) of $4 and $5 will be printed.
I want to print lines based on value in particular column that appear only once. In example below, val2 and val3 appear only once.
Input
val1,1
val2,2
val1,3
val3,4
Output
val2,2
val3,4
uniq -u does not seem to have option of specifying a column. I also tried sort -t, -k1,1 -u but that prints every row once.
awk -F, '{c[$1]++; t[$1]=$0} END {for(k in c) {if (c[k]==1) print t[k]}}'
Sounds like a problem for awk, assume that the command that produces
val1,1
val2,2
val1,3
val3,2
Is called foo, then pipe it into awk like so:
foo | awk -F, '$2 == 2 {print}'