sort on pipe-delimited fields not behaving as expected - sorting

Consider this tiny text file:
ab
a
If we run it through sort(1), we get
a
ab
because of course a comes before ab.
But now consider this file:
ab|c
a|c
If we run it through sort -t'|', we again expect a to sort before ab, but it does not! (Try it under your version of Unix and see.)
What I think is happening here is that the -t option to sort is not really delimiting fields -- it may be changing the way (say) the start of field 2 would be found, but it's not changing the way field 1 ends. a|c sorts after ab|c because '|' comes after 'b' in ASCII. (It's as if the -t'|' argument is ignored, because you get the same result without it.)
So is this a bug in sort or in my understanding of it? And is there a way to sort on the first pipe-delimited field properly?
This question came up in my attempt to answer another SO question, Join Statement omitting entries .

sort's default behavior is to treat everything from field 1 to the end of the line as the sort key. If you want it to sort on field 1 first, then field 2, you need to specify that explicitly.
$ sort -k1,1 -k2,2 -t'|' <<< $'ab|c\na|c'
a|c
ab|c

Related

Terminal: SORT command; how to sort correctly?

I have written a shell script that gets all the file names from a folder, and all its sub-folders, and copies them to the clipboard after sorting (removing all paths; I just need a simple file list of the thousands of randomly named files within).
What I can’t figure out is how to get the SORT command to sort properly. Meaning, the way a spreadsheet would sort things. Or the way your Mac finder sorts things.
Underscores > numbers > letters (regardless of case)
Anyone know how to do this? Sort -n only works for files starting with numbers, sort -f was close but separated the lower case and capitals in a weird way, and anything starting with a number was all over the place. Sort -V was the closest, but anything started with an underscore went to the bottom instead of the top… I’m about to lose my mind. 🤣
I’ve been trying to figure this out for a week, and no combination of anything I have tried gets the sort command to actually, ya know, sort properly.
Help?
If I understand the problem correctly, you want the "natural sort order" as described in Natural sort order - Wikipedia, Sorting for Humans : Natural Sort Order, and macos - How does finder sort folders when they contain digits and characters?.
Using Linux sort(1) you need the -V (--version-sort) option for "natural" sort. You also need the -f (--ignore-case) option to disregard the case of letters. So, assuming that the file names are stored one-per-line in a file called files.txt you can produce a list (mostly) sorted in the way that you want with:
sort -Vf files.txt
However, sort -Vf sorts underscores after digits and letters on my system. I've tried using different locales (see How to set locale in the current terminal's session?), but with no success. I can't see a way to change this with sort options (but I may be missing something).
The characters . and ~ seem to consistently sort before numbers and letters with sort -V. A possible hack to work around the problem is to swap underscore with one of them, sort, and then swap again. For example:
tr '_~' '~_' <files.txt | LC_ALL=C sort -Vf | tr '_~' '~_'
seems to do what you want on my system. I've explicitly set the locale for the sort command with LC_ALL=C ... so it should behave the same on other systems. (See Why doesn't sort sort the same on every machine?.)
It appears you want to sort in dictionary order and fold case, so it would be sort -df.

Bash: Sort file numerically, but only where the first field matches a pattern

Due to poor past naming practices, I'm left with a list of names that is proving to be a challenge to work with. The bottom line is that I want the most current name (by date) to be placed in a variable. All the names are listed (unsorted) in a file called bar.txt.
In this case I can't rename, and there's no way to get the actual dates of the images; these names are all I have to go on. The names can follow one of several patterns;
foo
YYYYMMDD-foo
YYYYMMDD##-foo
foo can be anything from a single character to a long string of letters/numbers/symbols. I am interested only in the names matching the second use case, YYMMDD-foo, as those are from after we started tagging consistently.
I would like to end up with a variable containing the most recent date that follows the pattern YYMMDD-foo.
I know sort -k1 -n < bar.txt, but then I'm not sure how to isolate the second pattern's results to extract what I need.
How do I sort the file to ignore anything but the second pattern, and return the most current date?
Sample
Given that bar.txt looks like this;
test
2017120901-develop-BUILD-31
20170326-TEST-1.2.0
20170406-BUILD-40-1.2.0-test
2010818_001
I would want to extract 20170406-BUILD-40-1.2.0-test
Since your requirement involves 1) to get only files of a certain format 2) apply sorting and get only the latest file. Am using a Awk & GNU sort together to achieve it
awk -F'-' 'length($1) == 8' file | sort -nrk1 | head -1
20170406-BUILD-40-1.2.0-test
The solution works by only getting those lines in the file whose first column has 8 characters exactly corresponding to YYYYMMDD alignment. Once those filtered, sort applied on first field and the first line is obtained using head.

Sorting two files that have the same column gives different sorting

I am sorry for the title but I didn't know how to explain this:
I am trying to sort two files because I want to merge them, they look like this:
test1.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
I use sort -k1 test1.txt and sort -k1 test2.txt and got this:
test1_sort.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2_sort.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
Why is there a different sorting if both first columns have the same values.
I also tried sort -n -s k1,1 but got the same result.
Add spaces:
$ sort -k 1,1 /tmp/2
rs1010735 C
rs10108 C
rs1010805 T
rs1010863 T
rs1010891 T
rs1010910 A
rs1011124 A
$ sort -k 1,1 /tmp/1
rs1010735 224915429
rs10108 114516330
rs1010805 38189142
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
There are two issues here.
Locale-aware sorting
At base, the problem here is that you are sorting according to your "locale", which is presumably en_US.UTF-8 (or some other Unicode locale). In theory, a locale-aware sort will produce an ordering which is what would be expected according to the normal sorting rules for that location, while a non-locale-aware sort will sort according to the "arbitrary" character codes for each character.
In a locale-aware sort, for example, it would be common for a word starting with a capital letter to come just before (or just after) the same word starting with a lower-case letter, whereas an non-locale-aware sort will put all the words starting with a capital letter before any word starting with a lower-case letter. Also, in an English-speaking locale, you would probably find words starting with an ä intermingled with words starting with a, whereas in a Swedish-locale, you'd find them after words starting with z because in Swedish, ä is the 28th letter (it comes after å and before ö, in case you're interested).
For all that to work, the locale descriptions on your machine need to actually describe the sorting order which would be expected in each locale, and particularly with the default locale, which should correspond to what you would expect. As can be seen from this example, that is sometimes not the case. Indeed, it sometimes produces bizarrely unexpected results.
What is happening in your example is that the locale description for your locale says that whitespace does not participate in sortation. It also indicates that digits come before letters. Now, consider a subset of your data (with both files combined):
rs10108 114516330
rs1010805 38189142
rs1010863 185432942
rs10108 C
rs1010805 T
rs1010863 T
If we eliminate the whitespace altogether, that would be:
rs10108114516330
rs101080538189142
rs1010863185432942
rs10108C
rs1010805T
rs1010863T
And if we then sort that according to normal alphabetic rules, with digits first, we get:
rs101080538189142
rs1010805T
rs10108114516330
rs1010863185432942
rs1010863T
rs10108C
Or, putting the whitespace back:
rs1010805 38189142
rs1010805 T
rs10108 114516330
rs1010863 185432942
rs1010863 T
rs10108 C
Those are the rules sort is following, and the result is that the two lines whose first field is rs10108 do not get sorted together. Counter-intuitive, ¿no?
Probably the correct solution would be to tell whoever built the locale files for your distribution that the normal rule is "nothing (visible) comes before something", which was the alphabetization rule we were taught in school. In other words, a space (nothing visible) comes before any character. Or you could try to fix the collation files yourself.
But in practical terms, the solution is to tell sort to do a non-locale-aware sort by default. I do that by putting:
export LC_COLLATE=C
in my bash startup files. (C is the special name of the locale corresponding to the programming language "C", in which symbols are sorted by their internal character codes.) You could also just type that everytime you want to sort something:
LC_COLLATE=C sort test1.txt
The meaning of the -k argument
The -k argument to sort has the basic syntax:
-kstart[,end]
where the positions start (and optionally end) define a range of text to use as a sort key. If end is not specified, the range continues to the end of the line.
The simplest form of a position is just a field number, such as 1, meaning "the first field". But -k1 does nothing, because it means, precisely, "use the text from the first field to the end of the line", which is essentially the same as saying "use the entire line as a sort key", which is the default. So anytime you see -k1 you should know that it is not doing what is expected.
Explicitly specifying the end would be more precise: -k1,1 means the the sort key is the text from the (start of) the first field to the (end of) the first field, or in other words, the first field. That would be better, but it wouldn't provide any hint on how to sort two lines which had the same first field. The standard sort utility is not "stable" by default, so it is not predictable which order two such lines will be sorted. It would generally be better to add more secondary sort fields:
sort -k1,1 -k2,2
which means, effectively, "sort by the first field, but if the first fields are equal, then compare the second fields."
Fields are split at whitespace (even if whitespace is ignored for sortation), so the above is different from sort -k1,2 in that it is guaranteed to put lines with the same value in the first field in consecutive positions.
Appendix: Why locales ignore whitespace in sorting
Unfortunately, sort -k1,1 -k2,2 also might not do what you want, particularly if you do it in the "C" locale, because of the historic definition of sort fields used by sort. Unless an explicit delimiter is specified with the -t option, sort fields start with each whitespace character which follows a non-whitespace character. Consequently, all fields except the first field start with whitespace. That's fine if they all start with the same whitespace, but often fields have been lined up by explicitly adding the right number of space characters. And that almost always produces incorrect sorting on fields other than the first field.
Since that is not generally what is wanted, sort provides a way of suppressing this annoying behaviour: the b sort-key flag (sort key flags go at the end of the -k specification). This flag tells sort to ignore leading whitespace in a sort-key. Also, you can specify -b as a command-line option before any -k option to specify at all sort keys should be treated as having the b flag. That would suggest that the correct invocation of sort would be:
sort -k1,1 -k2,2b
or
sort -b -k1,1 -k2,2
Some people believe that it is irritating to have to specify b all the time (since it is almost always what you want), and complicated to explain to users why they have to do it. As a consequence, it may appear easier to set up the locale definitions to ignore whitespace, which will certainly cause leading whitespace to be ignored. The problem with that "solution" is that it produces results which are at least as confusing that the results caused by having sort include the spaces between fields in the field definition, but which are rather more difficult to fix because there is no simple way to modify a locale's collation order.

gnu/unix sort numerical only using first column?

With regular strings, if the first field matches, we sort by the next field and so on, and things work as we expect.
echo -e 'a c\na b' | sort #regular string sort
a b
a c
With numbers, if the first field matches, we…switch to string sort on subsequent fields? Why? I would think it would compare each field numerically.
echo -e '1 22\n1 3' | sort -n #numeric sort
1 22
1 3
FYI, using sort (GNU coreutils) 5.97 on RHEL 5.5.
What am I missing here? I know I can use -k to pick the field I want to sort on, but that drastically reduces the flexibility of input allowed, as it requires the user to know the numbers of fields.
Thanks!
Sadly you haven't missed anything. This apparently simple task - split lines into fields and then sort numerically on all of them - can't be done by the unix sort program. You just have to figure out how many columns there are and name them all individually as keys.
What's happening when you specify -n no other options is that the whole line is being passed to the "convert string to number" routine, which converts the number at the start of the line and ignores the rest. The split into fields is not done at all.
Your first example, without -n, is also doing whole-line comparison. It's not comparing "a" to "a" then "b" to "c". It's comparing "a b" to "a c".

Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?

This happens to me both on Linux and on cygwin, so I suspect it is not a bug. Still, I don't understand it. Can anyone explain?
Consider the following file (tab-delimited, and that's a regular apostrophe)
(I create it with cat to ensure that it wasn't non-printing characters that were the source of the problem)
$cat > temp
cat 1389
cat' 1747
ca't 3175
cat 46848484
ca't 720
$sort temp
<gives the exact same output as cat temp>
$sort -k1,1 temp
cat 1389
cat 46848484
cat' 1747
ca't 3456
ca't 720
Why do I have to ignore the second column in order to sort correctly?
I pulled up the manual for sort and noticed the following:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...
(see also:)
Unusual behaviour of linux's sort command
Why does the sort command sort differently if there are trailing fields?
There are a couple of things you can do:
You can sort naively by byte value using
LC_ALL="C" sort temp
This will give a more logical result, but it might not be the one you actually want.
You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:
LC_ALL="C" sort -d temp
To have sort output your locale information and hilight the sort key, you can use
sort --debug temp
Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.
They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?

Resources