"sort -c -k1" compares second field too? - sorting

"sort" correctly reports these two lines are out of order:
> echo "a b\na a" | sort -c
sort: -:2: disorder: a a
How do I tell sort to compare only the first field of each line? I tried:
> echo "a b\na a" | sort -c -k1
sort: -:2: disorder: a a
but it failed, as above.
Can I make sort compare the first field of each line only, or must I
used something like sed to trim the lines before comparing them?
EDIT: I'm using "sort (GNU coreutils) 7.2". I tried using a different field separator but it didn't help:
> echo "a b\na a" | sort -k1 -c -t" "
sort: -:2: disorder: a a
although I'm pretty sure space is the default separator anyway.

The following works as expected:
echo "a b\na a" | sort -s -c -k1,1
There were two problems with your sort invocation:
The argument to -k is a key definition that specifies a start and end position. If end position is omitted, it defaults to the last field of the line, not the start field. -k1,1 specifies both, telling sort not to include the second field in the comparison.
sort is not stable by default, which means it doesn't guarantee not to disturb the order of lines that compare equal. Quoting the documentation:
Finally, as a last resort when all keys compare equal, sort compares
entire lines as if no ordering options other than --reverse (-r)
were specified. The --stable (-s) option disables this
"last-resort comparison" so that lines in which all fields compare
equal are left in their original relative order.

Related

How does `--key (-k)` work for command `sort`?

From the manual of the command sort
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
Versions:
sort: GNU coreutils 5.93
OS: MAC OSX 10.11.6
Bash: GNU bash 3.2.57(1)
Terminal: 2.6.1
It does not quite help me to understand how to use this option. I've seen patterns like -k1 -k2 and -k1,2 (see this post), -k1.2 and -k1.2n (see this post) and -k3 -k1 -k4 (see this post).
How does the flag --key (-k) work for the command sort?
I only have a vague intuition about what can be done with the option -k but if it is handy to consider an example, I would be happy for you to consider numerically (-n) sorting the following input by the numbers that directly follow the word "row". If two records have the same value after the word "row", then sorting could be done numerically on the value that follows the letter "G".
H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt
The expected output is
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
The . specifies a starting position within a single field. You want to sort numerically on fields 2 (starting at character 4) and 3 (starting at character 2). The following should work:
sort -t_ -k2.4n -k3.2n tmp.txt
-t_ specifies the field separator
The first key is 2.4n
The second key, if the first keys are equal, is 3.2n
Technically, .txt is part of field 3, but when you ask for numeric sorting, the trailing non-digit characters are ignored.
(More correctly, -k2.4,2n -k3.2,3n prevents any additional fields from being included in each key; I think the simpler form shown above works because any overlap is "overwritten", as it were. n prevents field 3 by itself from being treated as a number, and there is no field 4.)
from the manpage
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number
and C a character position in the field; both are origin 1, and the stop position defaults
to the line's end. If neither -t nor -b is in effect, characters in a field are counted
from the beginning of the preceding whitespace. OPTS is one or more single-letter order‐
ing options [bdfgiMhnRrV], which override global ordering options for that key. If no key
is given, use the entire line as the key. Use --debug to diagnose incorrect key usage.
The implication is that sort splits lines into fields. The period separator is used to offset into the field. With _ as your separator, you'd use an offset of 4.
In this case, the field delimiter isn't whitespace and so you would need to specify it using the -t option.
sort uses a locale based search by default and it looks like you want these sorted numerically. The -n switch does this.
sort -t _ -k 2.4 -n
This isn't really a programming question, but here goes:
If you're using GNU sort, your desired output can be achieved by sort -V:
$ echo 'H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt' | sort -V
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
That's because -V compares numeric and general string segments separately and H, 3, _row are the same in all lines.

GNU `sort` command fails to sort with stable and (general) numeric sorting turned on

I came across a rather strange situation when using GNU sort 8.4 and 8.24 with different sorting methods:
Specifying stable and numeric sorting returns the original list:
$ printf '"A"\n"C"\n"B"\n' | sort -sn -k1,1
"A"
"C"
"B"
$ printf '"B"\n"A"\n"C"\n' | sort -sn -k1,1
"B"
"A"
"C"
...whereas specifying only a single sorting method works fine:
$ printf '"B"\n"A"\n"C"\n' | sort -n -k1,1
"A"
"B"
"C"
$ printf '"B"\n"A"\n"C"\n' | sort -g -k1,1
"A"
"B"
"C"
$ printf '"B"\n"A"\n"C"\n' | sort -s -k1,1
"A"
"B"
"C"
Question: Is the stable sort truly incompatible with (general) numeric sorting, or am I missing something here?
In that case, I would have expected an error as shown below:
$ printf '"B"\n"A"\n"C"\n' | sort -gn -k1,1
sort: options '-gn' are incompatible
Thanks in advance, any insight as to why this occurs is greatly appreciated!
Numeric sort sorts by the longest numeric prefix of the sort field, ignoring leading whitespace. The numeric prefix is allowed to be empty: "An empty digit string shall be treated as zero".
Stable sort retains the original order for lines whose keys compare equal, so if you stable numeric sort lines not starting with numbers, the output will be identical to the input.
The quote above is from the Posix standard; the full documentation for gnu sort can be found with info sort if documentation is correctly installed on your machine, or via the url at the bottom of the sort manpage, from which I extracted this link to the -n option.
The sort utility man page does not document the behavior of the -n option when used on non-numeric input. Any attempt to explain the behavior would be speculation without checking the source. Even then, the answer may only apply to that particular implementation.

sorting file names ascending where names have a dash in bash

I have a list of files in a folder.
The names are:
1-a
100-a
2-b
20-b
3-x
and I want to sort them like
1-a
2-b
3-x
20-b
100-a
The files are always a number, followed by a dash, followed by anything.
I tried a ls with a col and sort and it works, but I wanted to know if there's a simpler solution.
Forgot to mention: This is bash running on a Mac OS X.
Some ls implementations, GNU coreutils' ls is one of them, support the -v (natural sort of (version) numbers within text) option:
% ls -v
1-a 2-b 3-x 20-b 100-a
or:
% ls -v1
1-a
2-b
3-x
20-b
100-a
Use sort to define the fields.
sort -s -t- -k1,1n -k2 filenames.txt
The -t tells sort to treat - as the field separator in input items. -k1,1n instructs sort to first sort on the first field numerically; -k2 sorts using the remaining fields as the second key in cade the first fields are equal. -s keeps the sort stable (although you could omit it since the entire input string is being used in one field or another).
(Note: I'm assuming the file names do not contain newlines, so that something like ls > filenames.txt is guaranteed to produce a file with one name per line. You could also use ls | sort ... in that case.)

How to sort the output obtained with grep -c?

I use the following 'grep' command to get the count of the string alert in each of my files at the given path:
grep 'alert' -F /usr/local/snort/rules/* -c
How do I sort the resulting output in desired order- say ascending order, descending order, ordered by name, etc. An answer specific to these cases is sufficient.
You may freely suggest a command other than grep as well.
Pipe it into sort. Assuming your filenames have no colons, use the "-t" option to specify the colon as field saparator. Use -n for numerical sorting.
Example:
grep 'alert' -F /usr/local/snort/rules/* -c | sort -t: -n -k2
should split lines into fields separated by ":", use the second field for sorting, and treat this as numbers (so 21 is actually later than 3).

Sorting data based on second column of a file

I have a file of 2 columns and n number of rows.
column1 contains names and column2 age.
I want to sort the content of this file in ascending order based on the age (in second column).
The result should display the name of the youngest person along with name and then second youngest person and so on...
Any suggestions for a one liner shell or bash script.
You can use the key option of the sort command, which takes a "field number", so if you wanted the second column:
sort -k2 -n yourfile
-n, --numeric-sort compare according to string numerical value
For example:
$ cat ages.txt
Bob 12
Jane 48
Mark 3
Tashi 54
$ sort -k2 -n ages.txt
Mark 3
Bob 12
Jane 48
Tashi 54
Solution:
sort -k 2 -n filename
more verbosely written as:
sort --key 2 --numeric-sort filename
Example:
$ cat filename
A 12
B 48
C 3
$ sort --key 2 --numeric-sort filename
C 3
A 12
B 48
Explanation:
-k # - this argument specifies the first column that will be used to sort. (note that column here is defined as a whitespace delimited field; the argument -k5 will sort starting with the fifth field in each line, not the fifth character in each line)
-n - this option specifies a "numeric sort" meaning that column should be interpreted as a row of numbers, instead of text.
More:
Other common options include:
-r - this option reverses the sorting order. It can also be written as --reverse.
-i - This option ignores non-printable characters. It can also be written as --ignore-nonprinting.
-b - This option ignores leading blank spaces, which is handy as white spaces are used to determine the number of rows. It can also be written as --ignore-leading-blanks.
-f - This option ignores letter case. "A"=="a". It can also be written as --ignore-case.
-t [new separator] - This option makes the preprocessing use a operator other than space. It can also be written as --field-separator.
There are other options, but these are the most common and helpful ones, that I use often.
For tab separated values the code below can be used
sort -t$'\t' -k2 -n
-r can be used for getting data in descending order.
-n for numerical sort
-k, --key=POS1[,POS2] where k is column in file
For descending order below is the code
sort -t$'\t' -k2 -rn
Use sort.
sort ... -k 2,2 ...
Simply
$ sort -k2,2n <<<"$data"

Resources