I'm scratching my head about the results of sorting two columns with unix sort.
Here's some dummy data in a file called test:
A 2e-12
A 3e-14
A 1e-15
A 1.2e-13
B 1e-13
B 1e-14
C 4e-12
C 3e-12
I would like to sort by column 1 first, then column 2, to produce:
A 1e-15
A 3e-14
A 1.2e-13
A 2e-12
B 1e-14
B 1e-13
C 3e-12
C 4e-12
If I give it just the second column to sort on, it will sort the scientific notation correctly:
sort -g -k2 test
A 1e-15
B 1e-14
A 3e-14
B 1e-13
A 1.2e-13
A 2e-12
C 3e-12
C 4e-12
This stack question addresses a similar problem, but it seems that my test only breaks down when I ask for two columns to sort on.
This other example looks really close to what I want, but when I give separate -k it doesn't alter the behavior for my test set.
These trials:
sort -k1,1 -g test
sort -k1,1 -g -k1,2 test
sort -k1,1 -g -k2,1 test
Produce:
A 1.2e-13
A 1e-15
A 2e-12
A 3e-14
B 1e-13
B 1e-14
C 3e-12
C 4e-12
And these trials:
sort -g -k2 -k1 test
sort -g -k2 -k1,1 test
sort -g -k2,2 -k1,1 test
sort -k1,1 -g -k2,2 test
sort -k1,1 -g -k2,2 test
Produce:
A 1e-15
B 1e-14
A 3e-14
B 1e-13
A 1.2e-13
A 2e-12
C 3e-12
C 4e-12
I have tested with LANG=C and LC_ALL=C without luck. I'm running this on Red Hat and the version is GNU coreutils 8.22.
I figured it out while writing the stack question, so I thought I'd just go ahead and post the question with my solution.
I was confused about what the -kn,n meant and actually using sort with the --debug flag helped me find the answer.
This question pretty much nails it on the head: always use -kX,X to make sure I'm only considering one field at a time, and then specify g in the numeric field.
sort -k1,1 -k2,2g test
A 1e-15
A 3e-14
A 1.2e-13
A 2e-12
B 1e-14
B 1e-13
C 3e-12
C 4e-12
Yay!
Related
I have a problem when merging two files with ubuntu 'join' command. This has bothered me for a week. Actually I often use this command for file editing and this is the first time I met this problem. I have two files look like this(space separated here for example):
file1
A A A
B B B
...
file2
A D D
B E E
...
The files have 20+ columns, hundreds of rows (utf-8 character set, tab separted). I use join command to merge two files based on the first column. The result should be like this:
A A A D D
B B B E E
But I got this:
A A A
D D
B B B
E E
I tried output two files then merge with 'pasted' command and got the same results. Also, I didn't see any newline character at the first/end of each line. I tried sed -i 's/\n\t/\t/g' file. It doesn't work.
They still don't show in the same line ! Have you guys met his problem before? what could be the possible reason?
my command looks like this:
join -1 3 -2 1 -o 1.1,1.2,1.3,1.4,2.2,2.3,2.4 -t $'\t' <(sort -k3,3 file1) <(sort -k1,1 file2)
paste -d"\t" file1 file2
I need script that sorts a text file and remove the duplicates.
Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach.
In the man sort though, there is an -u option that does this at the time of sorting.
Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?
They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.
$ cat example
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping
but
$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.
Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of
an identical series of lines. */
Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.
One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.
They should be functionally equivalent, and sort -u should be more efficient.
I'm guessing the examples you're looking at simply didn't consider (or didn't have) "sort -u" as an option.
Does uniq sort?
I do not think so...
Because, at least on Ubuntu 18.04 and CentOS 6, it does not. It will just remove consecutive duplicates.
You can simply conduct a mini experiment.
Let the file sample.txt be:
a
a
a
b
b
b
a
a
a
b
b
b
cat sample.txt | uniq will output:
a
b
a
b
while cat sample.txt | sort -u will output:
a
b
sort | uniq may be functionally equivalent to sort -u.
I have several files in a directory in my unix system that I need to sort. The problem I'm having is that when using the sort -f command it sorts in the order a A b B c C etc. ls does the same ordering. Is there a way I can make it sort with the uppercase letter coming first? i.e. sort in the order A a B b C c ...
With multiple passes through the sort command I think this is what you want:
ls | sort -f -r | sort
Sorry for the vague title, I couldn't think of a better one...
I have 2 tab-delimited files with identical first columns (different numbers of total columns). I would like to sort both files by their first column.
I think I could do this either with it -t\t option or the -k1,12 option (since first column is never longer than 12 characters.) Both options produce the same (wrong) output.
Even though both files have the same first column, they are sorted differently. Notice that on the file1 I get 23,29,2; file2, I get 2,23,29.
$ head file1 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs10000023
rs10000029
rs1000002
rs10000030
$ head file2 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs1000002
rs10000023
rs10000029
rs10000030
how I can I sort both files such that the first column is in the same order in each?
Thank you!
sort -t $'\t' -k 1,1
Use $'\t' to have the shell interpret \t as a tab since sort doesn't parse escape sequences. Use -k to tell it to only sort on the first field rather than the entire line.
You might also want the -V flag if you want 2 to sort in between 0 and 10.
Assuming there is a text file:
10 A QAZ
5 A EDC
14 B RFV
3 A WSX
7 B TGB
I want to sort it with the second column as the main column with alphabet order and the first column as the secondary column with numeric order. The following is the expected result:
3 A WSX
5 A EDC
10 A QAZ
7 B TGB
14 B RFV
I tried sort -k 2,2 -k 1,1 a.txt -n and sort -k 2,2 -k 1,1 a.txt but both give the wrong results. What should I solve this problem? Thanks.
This should work:
sort -b -k2,2 -k1,1n
The -b is essential, without it, the output is wrong, since sort wrongly determines the position of the second column. See man sort (or here) for details.
Also, check your locale. They can influence how sort works.
This might work for you:
sort -k1.5,1.8 -k1.1,1.4n file