How to remove duplicates by column (inverse ordering) - bash

I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!

This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30

First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.

awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30

If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac

Related

Merge unsorted lines from two files based on similar part

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?
(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2
Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

Awk to (random) sample a file by id-uniques criteria

I'm learning AWK to read a big file which format is similar to this MasterFile:
Beth|4.00|0|
Dan|3.75|0|
Kathy|4.00|10|
Mark|5.00|20|
Mary|5.50|22|
Susie|4.25|18|
Jise|5.62|0|
Mark|5.60|23.3|
Mary|8.50|42|
Susie|8.75|8.8|
Jise|3.62|0.8|
Beth|3.21|10|
Dan|8.39|20|
I would like to sample by unique values (size K) from the first column with size N (I choose it).
What I have done is following: I select unique values from first column and save it as IDfile.txt. Later, I take K random values from that archive and I match it with the MasterFile. I mean:
awk -F\| 'BEGIN{srand()}{print rand() " " $0}' IDfile | sort -n | tail -n K| awk -F'[[:blank:]|]+' 'BEGIN{OFS="|"}{$1="";sub(/\|/,"")}'1>tmp | awk -F\| 'NR==FNR{a[$1];next} {for (i in a) if(index($0,i)) print $0}' tmp MasterFile
But the output has repeated values and the result that I'd like to get is like to (assuming that K=3):
Beth|4.00|0|
Mark|5.60|23.3|
Mary|5.50|22|
I know that my code is far from efficient [or nice] and I'm open to suggestions [].
Thanks!
this is the one of the right ways to do this
$ sort -t'|' -u -k1,1 file | shuf -n3
Mark|5.00|20|
Kathy|4.00|10|
Jise|5.62|0|
change -n3 to whatever number of unique entries you need.

Sort a file in unix by the absolute value of a field

I want to sort this file by the absolute value of the Linear regression (p) column in descending order. My attempt to do this didnt quite work. Im not sure what it fails. I found this code from http://www.unix.com/shell-programming-and-scripting/168144-sort-absolute-value.html.
awk -F',' '{print ($2>=0)?$2:-$2, $0}' OFS=',' mycsv1.csv | sort -n -k8,8 | cut -d ',' -f2-
X var,Y var,MIC (strength),MIC-p^2 (nonlinearity),MAS (non-monotonicity),MEV (functionality),MCN (complexity),Linear regression (p)
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
...
Please help me to understand the awk script to sort this file.
You could use sed and sort for this and follow the #hek2mgl's very smart logic of adding and removing a field at the end to retain the original number:
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' file | sort -t, -k9,9 -nr | cut -f1-8 -d,
sed -E 's/,([-]?)([0-9.]+)$/,\1\2,\2/' => creates field 9 as the absolute value of field 8
sort -t, -k9,9 -nr => sorts by the newly created field, numeric and descending order
cut -f1-8 -d, => removes the 9th field, restoring the output to its original format, with the desired sorting order
Here is the output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Take three steps:
(1) Temporarily create a 9th field which contains the abs value of field 8:
LC_COLLATE=C awk -F, 'NR>1{v=$NF;sub(/-/,"",v);printf "%s%s%s%s",$0,FS,v,RS}' file
^ ------ make sure this is set since sorting, especially the decimal point
depends on the local.
(2) Sort that output based on the 9th field:
command_1 | sort -t, -k9r
(3) Pipe that back to awk to remove the last field. NF-- decreases the number of fields which will effectively remove the last field. 1 is always true, that makes awk print the line:
command_2 | cut -d, -f1-8
Output:
AT1G01030,AT3G06520,0.61732,0.17639545,0.23569,0.58557,4.0,0.6640215
AT1G01030,AT1G55280,0.57287,0.20705527,0.19536,0.52857,4.0,0.6048262
AT1G01030,AT1G80040,0.56268,0.22935495,0.18583998,0.52728,4.0,-0.5773431
AT1G01030,AT1G32310,0.67958,0.4832027,0.32644996,0.63247,4.0,-0.44314474
AT1G01030,AT5G30490,0.56509,0.37536618,0.16172999,0.51847,4.0,-0.43557298
AT1G01030,AT5G42580,0.61579,0.5019064,0.30105,0.58143,4.0,0.33746648
Could get awk to do it all:
awk -F, 'NR>1{n[substr($NF,1,1)=="-"?substr($NF,2):$NF]=$0}NR==1;END{asorti(n,out);for(i in out)print n[out[i]]}' file

Unix - Sorting file name with a key but not knowing its position

I would like to sort those files using Unix commands:
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
The result I am waiting for here is MyFile_fdfdsf_20140326.txt
So I'd like to get the file with the newest date.
I can't use 'sort -k', as the position of the key (the date) may vary
But in my file name there are always two "_" delimiters and a dot '.' for the file extension
Any help would be appreciated :)
Then use -t to indicate the field separator and set it to _:
sort -t'_' -k3
See an example of sorting the file names if they are in a file. I used -n for numeric sort and -r for reverse order:
$ sort -t'_' -nk3 file
MyFile_dfgfdklm_19990101.tar.gz
MyFile_4fg5d6_20100301.csv
MyFile_fdfdsf_20140326.txt
$ sort -t'_' -rnk3 file
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
From man sort:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons
Update
Thank you for you answer. It's perfect. But out of curiosity, what if
I had an unknown number of delimiters, but the date was always after
the last "_" delimiter. MyFile_abc_def_...20140326.txt sort -t''
-nk??? file – user3464809
You can trick it a little bit: print the last field, sort and then remove it.
awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
See an example:
$ cat a
MyFile_fdfdsf_20140326.txt
MyFile_4fg5d6_20100301.csv
MyFile_dfgfdklm_19990101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
MyFile_dfgfdklm_asdf_asdfsadfas_29990101.tar.gz
$ awk -F_ '{print $NF, $0}' a | sort | cut -d'_' -f2-
dfgfdklm_asdf_asdfsadfas_19940101.tar.gz
dfgfdklm_19990101.tar.gz
4fg5d6_20100301.csv
fdfdsf_20140326.txt
dfgfdklm_asdf_asdfsadfas_29990101.tar.gz

Resources