Bash: how does 'sort' sort paths? - bash

How does sort work? I have this file:
/test# cat foobar
html/lib/ORM/aaa.php
html/lib/ORMBase/ormbase_aaa.php
html/lib/ORM/zzz.php
html/lib/ORMBase/ormbase_zzz.php
And this is the output of sort:
/test# cat foobar | sort
html/lib/ORM/aaa.php
html/lib/ORMBase/ormbase_aaa.php
html/lib/ORMBase/ormbase_zzz.php
html/lib/ORM/zzz.php
I tried a lot of options: -f, -i, -t/... and I dont get it. I want to understand why sort thinks this is sorted.
NB: It works fine with this other sample:
/test# cat foobar2
a/a/a
a/ab/a
a/ab/b
a/a/ab
a/abc/a
/test# cat foobar2 | sort
a/a/a
a/a/ab
a/ab/a
a/ab/b
a/abc/a

sort tries to be clever with regard to localization. It ignores some non-alphanumeric characters like / and so on. The man page has a short sentence on that:
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
So, to fix your issue:
$ cat foobar | LC_ALL=C sort

Related

GNU `sort` command fails to sort with stable and (general) numeric sorting turned on

I came across a rather strange situation when using GNU sort 8.4 and 8.24 with different sorting methods:
Specifying stable and numeric sorting returns the original list:
$ printf '"A"\n"C"\n"B"\n' | sort -sn -k1,1
"A"
"C"
"B"
$ printf '"B"\n"A"\n"C"\n' | sort -sn -k1,1
"B"
"A"
"C"
...whereas specifying only a single sorting method works fine:
$ printf '"B"\n"A"\n"C"\n' | sort -n -k1,1
"A"
"B"
"C"
$ printf '"B"\n"A"\n"C"\n' | sort -g -k1,1
"A"
"B"
"C"
$ printf '"B"\n"A"\n"C"\n' | sort -s -k1,1
"A"
"B"
"C"
Question: Is the stable sort truly incompatible with (general) numeric sorting, or am I missing something here?
In that case, I would have expected an error as shown below:
$ printf '"B"\n"A"\n"C"\n' | sort -gn -k1,1
sort: options '-gn' are incompatible
Thanks in advance, any insight as to why this occurs is greatly appreciated!
Numeric sort sorts by the longest numeric prefix of the sort field, ignoring leading whitespace. The numeric prefix is allowed to be empty: "An empty digit string shall be treated as zero".
Stable sort retains the original order for lines whose keys compare equal, so if you stable numeric sort lines not starting with numbers, the output will be identical to the input.
The quote above is from the Posix standard; the full documentation for gnu sort can be found with info sort if documentation is correctly installed on your machine, or via the url at the bottom of the sort manpage, from which I extracted this link to the -n option.
The sort utility man page does not document the behavior of the -n option when used on non-numeric input. Any attempt to explain the behavior would be speculation without checking the source. Even then, the answer may only apply to that particular implementation.

list words from file using shell script in alphabetical order and with no punctuation

I am using Shell script and bash commands.
I have to generate a list of words that are in alphabetical order from a file which has many sentences in it, i am using song lyrics to work this out on. I can return each word in alphabetical order but it still includes some apostrophes, question marks and full stops. to do this I use:
cat lyrics01.txt | tr "\"' " '\n' | sort -u >> lyrics01.wl
I know this tells the list to go down after each space and apostrophe but I need it to delete the punctuation and simply be the words in an alphabetical order.
I have tried implementing this part:
-d ',.;:-+=()'
after the 'tr' from my original code but it will not work. Any help for a simpler way or even to solve this would be much appreciated.
Assuming you want lines split on words but not split on punctuation so that "The world isn't fair." becomes
The
world
isnt
fair
and not
The
world
isn
t
fair
<blank line>
the following should do what you want
sed 's/[[:punct:]]*//g;s/ /\n/g' lyrics01.txt | sort -u >> lyrics01.wl
Try sed as below:
sed 's/\([[:punct:] ]\)/\n/g' lyrics01.txt | sort -u >> lyrics01.wl
This will remove any punctuation marks or space and replace it with new line character.
All of the examples seem to remove the single quote from the word "isn't"
If that is not what you want, I've tested and come up with this :
$ cat test.txt
The
world
isn't
fair.
Isn't it ?
$ sed "s/ /\n/g" test.txt | sed "s/[[:punct:]]$/\n/g" | grep .
The
world
isn't
fair
Isn't
it
$
It's not sorted, but this is to show you can retain punctionation if not at the end

What is the difference between 'sort -u' and 'uniq'?

I need script that sorts a text file and remove the duplicates.
Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach.
In the man sort though, there is an -u option that does this at the time of sorting.
Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?
They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.
$ cat example
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping
but
$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.
Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of
an identical series of lines. */
Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.
One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.
They should be functionally equivalent, and sort -u should be more efficient.
I'm guessing the examples you're looking at simply didn't consider (or didn't have) "sort -u" as an option.
Does uniq sort?
I do not think so...
Because, at least on Ubuntu 18.04 and CentOS 6, it does not. It will just remove consecutive duplicates.
You can simply conduct a mini experiment.
Let the file sample.txt be:
a
a
a
b
b
b
a
a
a
b
b
b
cat sample.txt | uniq will output:
a
b
a
b
while cat sample.txt | sort -u will output:
a
b
sort | uniq may be functionally equivalent to sort -u.

Sort filenames without leading zeros

i would like to sort stereo imagefiles with the following pattern
img_i_j.ppm,
where i is the image counter and j is the id of the camera [0,1].
Currently, if i sort them using
ls -1 *.ppm | sort -n
the result looks like that:
img_0_0.ppm
img_0_1.ppm
img_10_0.ppm
img_10_1.ppm
img_1_0.ppm
img_11_0.ppm
img_11_1.ppm
img_1_1.ppm
img_12_0.ppm
But i need to have this output:
img_0_0.ppm
img_0_1.ppm
img_1_0.ppm
img_1_1.ppm
img_2_0.ppm
img_2_1.ppm
...
img_10_0.ppm
img_10_1.ppm
...
Is this achievable without adapting the filename?
As seen on the comments, use
sort -V
I initially posted it as a comment because this parameter is not always in the sort binary, so you have to use sort -k -n ... (for example like here).
ls (now?) has the -v option, which does what you want. From man ls:
-v natural sort of (version) numbers within text
This is simpler than piping to sort, and follows advice not to parse ls.
If you actually intend to parse the output, I imagine that you can mess with LC_COLLATE in bash. Alternatively, in zsh, you can just use the glob *(n) instead.

Sorting floats with exponents with 'sort -g' bash command

I have a file with floats with exponents and I want to sort them. AFAIK 'sort -g' is what I need. But it seems like it sorts floats throwing away all the exponents. So the output looks like this (which is not what I wanted):
$ cat file.txt | sort -g
8.387280091e-05
8.391373668e-05
8.461754562e-07
8.547354437e-05
8.831553093e-06
8.936111118e-05
8.959458896e-07
This brings me to two questions:
Why 'sort -g' doesn't work as I expect it to work?
How cat I sort my file with using bash commands?
The problem is that in some countries local settings can mess this up by using , as the decimal separator instead of . on a system level. Check by typing locale in terminal. There should be an entry
LC_NUMERIC=en_US.UTF-8
If the value is anything else, change it to the above by editing the locale file
sudo gedit /etc/default/locale
That's it. You can also temporarily use this value by doing
LC_ALL=C sort -g file.dat
LC_ALL=C is shorter to write in terminal, but putting it in the locale file might not be preferable as it could alter some other system-wide behavior such as maybe time format.
Here's a neat trick:
$ sort -te -k2,2n -k1,1n test.txt
8.461754562e-07
8.959458896e-07
8.831553093e-06
8.387280091e-05
8.391373668e-05
8.547354437e-05
8.936111118e-05
The -te divides your number into two fields by the e that separates out the mantissa from the exponent. the -k2,2 says to sort by exponent first, then the -k1,1 says to sort by your mantissa next.
Works with all versions of the sort command.
Your method is absolutely correct
cat file.txt | sort -g
If the above code is not working , then try this
sed 's/\./0000000000000/g' file.txt | sort -g | sed 's/0000000000000/\./g'
Convert '.' to '0000000000000' , sort and again subsitute with '.'. I chose '0000000000000' to replace so as to avoid mismatching of the number with the inputs.
You can manipulate the number by your own.

Resources