Unexpected bash sort behavior - bash

If I create a text file containing the following lines:
>TESTTEXT_10000000
>TESTTEXT_1000000
>TESTTEXT_10000002
>TESTTEXT_10000001
and perform sort myfile, my output is
>TESTTEXT_1000000
>TESTTEXT_10000000
>TESTTEXT_10000001
>TESTTEXT_10000002
However, if I append /1 and /2 to my lines the sort output changes drastically, and I do not know why.
Input:
>TESTTEXT_10000000/1
>TESTTEXT_1000000/1
>TESTTEXT_10000002/1
>TESTTEXT_10000001/1
Output:
>TESTTEXT_10000000/1
>TESTTEXT_1000000/1
>TESTTEXT_10000001/1
>TESTTEXT_10000002/1
Input:
>TESTTEXT_10000000/2
>TESTTEXT_1000000/2
>TESTTEXT_10000002/2
>TESTTEXT_10000001/2
Output:
>TESTTEXT_10000000/2
>TESTTEXT_10000001/2
>TESTTEXT_1000000/2
>TESTTEXT_10000002/2
Is the forward slash being recognised as a seperator? using --field-sperator did not alter the behaviour. If so, why is 1000000/2 in between the 1000001/2 and 1000002/2 entries? Using the human sort, numeric sort or other options never brought about consistency. Can anyone help me out here?
:edit:
Because it seems to be relevant, considering the answers, the value of LC_ALL on this machine is en_GB.UTF-8

/ is before 0 in your locale. Using LC_ALL=C or other locale will properly not change anything.
In your use case you would properly be able to use -Version sort:
sort -V myfile
Alternative can you specify the separator and keys to sort on:
sort -t/ -k1,1 myfile

Related

How to sort characters after a period

Need help on how to sort characters or numbers after a period(.)
test2.rod1
test1.rod1
test3.rod1
test1.mor2
test2.mor2
test3.mor2
zbcd1.abc1
abcd2.abc1
dbcd3.abc1
I would like the sort result anything after the period (.). Result should be something like below.
abcd2.abc1
dbcd3.abc1
zbcd1.abc1
test1.mor2
test2.mor2
test3.mor2
test2.rod1
test1.rod1
test3.rod1
If you're using a system with Unix like utilities such as MacOS, Linux, BSD, etc, then you can use the system sort command. The secret is to specify the field delimiter, which in your case is a period. The argument is either -t or --field-separator. So the following should work:
sort -t. -k 2 test.dat
Assuming that your data is in a file called test.dat

Unix sort: sort by specific character following another character

I have a file that contains information in the following form:
"dog/3/cat/6/fish/2/78/90"
(we'll not worry about the last two values here)
Is it possible to sort the contents of the file by the numeric value after the odd numbered slashes with the unix sort command?
For instance, the output might look like this:
dog/4/house/3/frog/89/100
dog/3/mouse/2/chicken/12/68/80
dog/2/cat/5/bird/12/77/90
This should give you what you want, I think:
sort -t/ -k2,2nr -k4,4nr -k6,6nr

unix sort -n -t"," gives unexpected result

unix numeric sort gives strange results, even when I specify the delimiter.
$ cat example.csv # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?
I'm using
sort 5.93 November 2005
$ type sort
sort is /usr/bin/sort
UPDATE
I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately
581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...
Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.
This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.
So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."
I'm not sure this is entirely correct, but it's close.
sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.
If you explicitly specify which single keys to sort on with
sort -k1,1n -k2,2n -t,
it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).
I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.
Use sort --debug to find out what's going on.
I've used that to explain in detail your issue at:
http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html
If you use
cat example.csv | sort
instead of
cat example.csv | sort -n --field-separator=,
then it would give correct output. Use this command, hope this is helpful to you.
Note: I tested with "sort (GNU coreutils) 7.4"

Removing duplicate lines from text file

I'm trying to remove all duplicate lines from a file and using this command:
sort text.txt | uniq -u > ALL.txt
But am getting this error:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `http://lestarsmagazine.com/2011/10/07/adja-ndoye-ex-mannequin-\253-balla-gaye-adja-diallo-mara-ndiaye-l\222alcool-la-drogue-et-moi-\273/2691278-3806038/ | 0\r' and `http://sopfree.com/slight-conditioning/ | 0\r'.
What do I need to change the command to in order to work around this problem?
LC_ALL='C' sort text.txt | LC_ALL='C' uniq > ALL.txt
Edit: Removed the '-u'. From your description it sounds like you shouldn't be using it. You may have misunderstood the man page. That option will skip non-unique lines from the input rather than merging them.
The problem is not that your command is incorrect, but rather your data. From the error, it looks like the line separators in text.txt are incorrect or mangled. I'd strongly suggest you review your data (even just opening it in a text editor and saving it back out again may fix it) or post it here so someone else can review it.

linux sort unexpected output

I use sort file
ABC
AB-C
ABCDEFG-HI
I get
ABC
AB-C
ABCDEFG-HI
why does sort orders the string this way? how do I make it sort '-' alphabetically?
The solution provided by #cnicutar is correct, but the reason needs explanation which is why I'm giving a new answer.
After the discussion with #cnicutar where in the end I suspected a bug in coreutils' sort I found that this sorting behavior is expected:
At that point sort appears broken because case is folded and punctuation is ignored because ‘en_US.UTF-8’ specifies this behavior.
So to sort, your input seems to be mapped as follows:
ABC -> ABC
AB-C -> ABC
ABCDEFG-HI -> ABCDEFGHI
If you want pure ASCII sorting, you need to call LC_ALL=C sort (temporarily set the locale to C when calling sort which means "standard" behavior without localization; you can also use POSIX instead of C).
On other Unixes this behavior seems to be different (tested on Mac OS X which userland tools are derived from FreeBSD), but LC_ALL=C sort should yield the same behavior across all POSIX systems.
I remember this :)) try
[cnicutar#aiur ~]$ LANG=POSIX sort
ABC
AB-C
ABCDEFG-HI
AB-C
ABC
ABCDEFG-HI
Alternatively LANG=C should work.

Resources