sort lines by the last "element" [non-csv text file] - bash

for lines with same number of columns separated by a dot delimiter, like
aa.bb
cc.dd
...
it's easy to sort by last column
sort -t. -k2,2 file
if the text file have different "columns", like
aa.b.xb
cc.dd
xx.cc.aa
a.b.c.d.e
...
then how to sort the lines by the last "column"
xx.cc.aa
cc.dd
a.b.c.d.e
aa.b.xb
...

You can make use of the Schwartzian transform in bash.
awk -F. '{print $NF "\t" $0}' file | sort -k1,1 | cut -f2-
First extract the last column and prepend it to the line delimited by
a tab character.
Then sort the lines with the 1st (prepended) column.
Finaly remove the 1st column with cut command.

Related

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

Unix- smallest value in column

I have a txt file which is just three columns of numbers separated by space. I need to use "sort" to display the smallest value of column 3 and only that value.
I tried
sort -k3 file.txt|head -1
but it shows the first value of all three columns.
This is what's expected. sort -k3 file.txt | head -1 says "show me the first line of output"
Use just plain sort -k3 file.txt | head to get the first 10 lines.
What were you expecting or wanting?
In response to the comment: No worries! We're all beginners at the beginning :-)
sort -r file.txt will sort in reverse order, and as #shellter says, sort -r -k3 file.txt | awk 'NR==1{print $3} will print the third value on the first line.

Sort by length of column

Need help sorting by length of the 4th column with a Unix command.
Example data (all data is made up, and not actual).
5032:Stack:overflows#business.com:123:JamesPeterson
3200:Admin:admin#me.com:12ej3dij23i2j32:AdminAdmin
1024:GregoryJames:greg#admin.com:12329232:GregJames
Preferred format (Because the length of 4th column is the longest).
3200:Admin:admin#me.com:12ej3dij23i2j32:AdminAdmin
1024:GregoryJames:greg#admin.com:12329232:GregJames
5032:Stack:overflows#business.com:123:JamesPeterson
Use awk to add a column containing the length of the column, sort by that, then remove it.
awk -F: '{printf("%d %s\n", length($4), $0)}' input.txt | sort -nr | cut -d' ' -f2- > output.txt

Sort a file in unix by the absolute value of a field

I want to sort this file by the absolute value of the Linear regression (p) column in descending order. My attempt to do this didnt quite work. Im not sure what it fails. I found this code from http://www.unix.com/shell-programming-and-scripting/168144-sort-absolute-value.html.
awk -F',' '{print ($2>=0)?$2:-$2, $0}' OFS=',' mycsv1.csv | sort -n -k8,8 | cut -d ',' -f2-
X var,Y var,MIC (strength),MIC-p^2 (nonlinearity),MAS (non-monotonicity),MEV (functionality),MCN (complexity),Linear regression (p)
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
...
Please help me to understand the awk script to sort this file.
You could use sed and sort for this and follow the #hek2mgl's very smart logic of adding and removing a field at the end to retain the original number:
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' file | sort -t, -k9,9 -nr | cut -f1-8 -d,
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' => creates field 9 as the absolute value of field 8
sort -t, -k9,9 -nr => sorts by the newly created field, numeric and descending order
cut -f1-8 -d, => removes the 9th field, restoring the output to its original format, with the desired sorting order
Here is the output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Take three steps:
(1) Temporarily create a 9th field which contains the abs value of field 8:
LC_COLLATE=C awk -F, 'NR>1{v=$NF;sub(/-/,"",v);printf "%s%s%s%s",$0,FS,v,RS}' file
^ ------ make sure this is set since sorting, especially the decimal point
depends on the local.
(2) Sort that output based on the 9th field:
command_1 | sort -t, -k9r
(3) Pipe that back to awk to remove the last field. NF-- decreases the number of fields which will effectively remove the last field. 1 is always true, that makes awk print the line:
command_2 | cut -d, -f1-8
Output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Could get awk to do it all:
awk -F, 'NR>1{n[substr($NF,1,1)=="-"?substr($NF,2):$NF]=$0}NR==1;END{asorti(n,out);for(i in out)print n[out[i]]}' file

Unix - Sorting file name with a key but not knowing its position

I would like to sort those files using Unix commands:
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
The result I am waiting for here is MyFile_fdfdsf_20140326.txt
So I'd like to get the file with the newest date.
I can't use 'sort -k', as the position of the key (the date) may vary
But in my file name there are always two "_" delimiters and a dot '.' for the file extension
Any help would be appreciated :)
Then use -t to indicate the field separator and set it to _:
sort -t'_' -k3
See an example of sorting the file names if they are in a file. I used -n for numeric sort and -r for reverse order:
$ sort -t'_' -nk3 file
MyFile_dfgfdklm_19990101.tar.gz
MyFile_4fg5d6_20100301.csv
MyFile_fdfdsf_20140326.txt
$ sort -t'_' -rnk3 file
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
From man sort:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons
Update
Thank you for you answer. It's perfect. But out of curiosity, what if
I had an unknown number of delimiters, but the date was always after
the last "_" delimiter. MyFile_abc_def_...20140326.txt sort -t''
-nk??? file – user3464809
You can trick it a little bit: print the last field, sort and then remove it.
awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
See an example:
$ cat a
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_29990101.tar.gz
$ awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
dfgfdklm_19990101.tar.gz
4fg5d6_20100301.csv
fdfdsf_20140326.txt
dfgfdklm_asdf_asdfsadfas_29990101.tar.gz

Resources