reverse unix sort on non-existent field - sorting

I have a largish file with lines like this: (^I represents a tab, $ end-of-line)
2^IElaeocarpus williamsianus^I48$
4^I$
6^I$
8^I$
10^I$
12^I$
14^IElaeocarpus hookerianus^I73$
16^IElaeocarpus kirtonii^I111$
20^I$
22^ITetratheca juncea^I66$
42^IMalagasy giant rat^I401$
and I want to sort the lines so that those with the highest number in the 3rd field (i.e. after the 2nd tab) come first, i.e.
42^IMalagasy giant rat^I401$
16^IElaeocarpus kirtonii^I111$
14^IElaeocarpus hookerianus^I73$
22^ITetratheca juncea^I66$
2^IElaeocarpus williamsianus^I48$
4^I$
6^I$
8^I$
10^I$
12^I$
20^I$
(I don't care about the order of the lines with no field 3). So I assumed something like the following would work
sort -r -t $'\t' -k 3,3n myfile
but it doesn't (GNU sort, OS X 10.9). I feel I'm being stupid. What's the correct incantation?

You need to add a modifier to your -k parameter, not the command line parameter.
So something in these lines should do the trick:
sort -t $'\t' -k 3,3nr myfile

It seems that it isn't n what you want, but g.
sort -t $'\t' test.txt -k 3.2gr
The dot specifies in the key at which character to start comparing.
As favoretti pointed out, what you want to reverse is by that column, so you apply the modifier there.

Related

bash sort alphanumeric buildnumber

I have a list of buildnumbers which I get from my buildserver, like this:
1.0.0.b1
1.0.0.b10
1.0.0.b11
1.0.0.b12
1.0.0.b13
1.0.0.b14
1.0.0.b15
1.0.0.b16
1.0.0.b17
1.0.0.b18
1.0.0.b19
1.0.0.b2
1.0.0.b20
1.0.0.b21
1.0.0.b22
1.0.0.b3
1.0.0.b4
1.0.0.b5
1.0.0.b6
1.0.0.b7
1.0.0.b8
1.0.0.b9
now I need to sort this where I expect the highes buildnumber on the bottom like this:
1.0.0.b1
1.0.0.b2
1.0.0.b3
1.0.0.b4
1.0.0.b5
1.0.0.b6
1.0.0.b7
1.0.0.b8
1.0.0.b9
1.0.0.b10
1.0.0.b11
1.0.0.b12
1.0.0.b13
1.0.0.b14
1.0.0.b15
1.0.0.b16
1.0.0.b17
1.0.0.b18
1.0.0.b19
1.0.0.b20
1.0.0.b21
1.0.0.b22
now in linux with GNU sort it is easy - just use sort -V
But this has also to work on macOS where I do not have any experience on it, but from testing I know -V does not work there.
I tried with
sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n
but no luck there.
I want to have it sorted by Version/buildnumber, e.g.
1.1.3.b5 is higher than 1.0.3.b66
what have I missed here? Can you please help me? Also, unfortuneatly, installing homebrew coreutils are not an option
thank you,
br Alex
I assume your real full list won't have all b versions. You'll need to split field 4 into two keys; one for the alpha part and one for the numeric part.
$: sort -t. -k1n -k2n -k3n -k4.1,4.1 -k4.2n vnums
1.0.0.a5
1.0.0.a10
1.0.0.a13
1.0.0.a19
1.0.0.b1
1.0.0.b6
1.0.0.b8
1.0.0.b9
1.0.0.b12
1.0.0.b14
1.0.0.b17
1.0.0.b20
1.0.0.b21
1.0.0.b22
1.0.0.c3
1.0.0.c7
1.0.0.c15
1.0.0.c16
1.0.0.d2
1.0.0.d4
1.0.0.d11
1.0.0.d18
1.0.3.b66
1.1.3.b5
Note the limiting of the alpha column of field 4 to a single character.
Perl one-liner
Assuming that there is no more than 6 digit (or fixed size), sprintf "%06", $& will sort over numbers left padded with 0:
perl -e 'sub v{"#_"=~s/\d+/sprintf"%06d",$&/ger}print sort{v($a)cmp v($b)} <>' inputfile
Treat the forth key as composed of subfields:
sort -t. -k1n,1 -k2n,2 -k3n,3 -k4.1,4.1 -k4.2n

How does `--key (-k)` work for command `sort`?

From the manual of the command sort
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
Versions:
sort: GNU coreutils 5.93
OS: MAC OSX 10.11.6
Bash: GNU bash 3.2.57(1)
Terminal: 2.6.1
It does not quite help me to understand how to use this option. I've seen patterns like -k1 -k2 and -k1,2 (see this post), -k1.2 and -k1.2n (see this post) and -k3 -k1 -k4 (see this post).
How does the flag --key (-k) work for the command sort?
I only have a vague intuition about what can be done with the option -k but if it is handy to consider an example, I would be happy for you to consider numerically (-n) sorting the following input by the numbers that directly follow the word "row". If two records have the same value after the word "row", then sorting could be done numerically on the value that follows the letter "G".
H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt
The expected output is
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
The . specifies a starting position within a single field. You want to sort numerically on fields 2 (starting at character 4) and 3 (starting at character 2). The following should work:
sort -t_ -k2.4n -k3.2n tmp.txt
-t_ specifies the field separator
The first key is 2.4n
The second key, if the first keys are equal, is 3.2n
Technically, .txt is part of field 3, but when you ask for numeric sorting, the trailing non-digit characters are ignored.
(More correctly, -k2.4,2n -k3.2,3n prevents any additional fields from being included in each key; I think the simpler form shown above works because any overlap is "overwritten", as it were. n prevents field 3 by itself from being treated as a number, and there is no field 4.)
from the manpage
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number
and C a character position in the field; both are origin 1, and the stop position defaults
to the line's end. If neither -t nor -b is in effect, characters in a field are counted
from the beginning of the preceding whitespace. OPTS is one or more single-letter order‐
ing options [bdfgiMhnRrV], which override global ordering options for that key. If no key
is given, use the entire line as the key. Use --debug to diagnose incorrect key usage.
The implication is that sort splits lines into fields. The period separator is used to offset into the field. With _ as your separator, you'd use an offset of 4.
In this case, the field delimiter isn't whitespace and so you would need to specify it using the -t option.
sort uses a locale based search by default and it looks like you want these sorted numerically. The -n switch does this.
sort -t _ -k 2.4 -n
This isn't really a programming question, but here goes:
If you're using GNU sort, your desired output can be achieved by sort -V:
$ echo 'H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt' | sort -V
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
That's because -V compares numeric and general string segments separately and H, 3, _row are the same in all lines.

sorting file names ascending where names have a dash in bash

I have a list of files in a folder.
The names are:
1-a
100-a
2-b
20-b
3-x
and I want to sort them like
1-a
2-b
3-x
20-b
100-a
The files are always a number, followed by a dash, followed by anything.
I tried a ls with a col and sort and it works, but I wanted to know if there's a simpler solution.
Forgot to mention: This is bash running on a Mac OS X.
Some ls implementations, GNU coreutils' ls is one of them, support the -v (natural sort of (version) numbers within text) option:
% ls -v
1-a 2-b 3-x 20-b 100-a
or:
% ls -v1
1-a
2-b
3-x
20-b
100-a
Use sort to define the fields.
sort -s -t- -k1,1n -k2 filenames.txt
The -t tells sort to treat - as the field separator in input items. -k1,1n instructs sort to first sort on the first field numerically; -k2 sorts using the remaining fields as the second key in cade the first fields are equal. -s keeps the sort stable (although you could omit it since the entire input string is being used in one field or another).
(Note: I'm assuming the file names do not contain newlines, so that something like ls > filenames.txt is guaranteed to produce a file with one name per line. You could also use ls | sort ... in that case.)

Sort and remove duplicates based on column

I have a text file:
$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10
I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.
Approach 1
$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1
It is not sorting based on the first column.
Approach 2
$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1
It removes the 542,9,1,418,1 line but I'd like to keep one copy.
It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?
The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.
Your best bet would be to either sort all columns:
sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text
or
use awk to filter duplicate lines and pipe it to sort.
awk '!_[$0]++' text | sort -t, -nk1,1
When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.
The following should work:
sort -t, -u -k1,1n text

Sorting floats with exponents with 'sort -g' bash command

I have a file with floats with exponents and I want to sort them. AFAIK 'sort -g' is what I need. But it seems like it sorts floats throwing away all the exponents. So the output looks like this (which is not what I wanted):
$ cat file.txt | sort -g
8.387280091e-05
8.391373668e-05
8.461754562e-07
8.547354437e-05
8.831553093e-06
8.936111118e-05
8.959458896e-07
This brings me to two questions:
Why 'sort -g' doesn't work as I expect it to work?
How cat I sort my file with using bash commands?
The problem is that in some countries local settings can mess this up by using , as the decimal separator instead of . on a system level. Check by typing locale in terminal. There should be an entry
LC_NUMERIC=en_US.UTF-8
If the value is anything else, change it to the above by editing the locale file
sudo gedit /etc/default/locale
That's it. You can also temporarily use this value by doing
LC_ALL=C sort -g file.dat
LC_ALL=C is shorter to write in terminal, but putting it in the locale file might not be preferable as it could alter some other system-wide behavior such as maybe time format.
Here's a neat trick:
$ sort -te -k2,2n -k1,1n test.txt
8.461754562e-07
8.959458896e-07
8.831553093e-06
8.387280091e-05
8.391373668e-05
8.547354437e-05
8.936111118e-05
The -te divides your number into two fields by the e that separates out the mantissa from the exponent. the -k2,2 says to sort by exponent first, then the -k1,1 says to sort by your mantissa next.
Works with all versions of the sort command.
Your method is absolutely correct
cat file.txt | sort -g
If the above code is not working , then try this
sed 's/\./0000000000000/g' file.txt | sort -g | sed 's/0000000000000/\./g'
Convert '.' to '0000000000000' , sort and again subsitute with '.'. I chose '0000000000000' to replace so as to avoid mismatching of the number with the inputs.
You can manipulate the number by your own.

Resources