sort multiple columns numerically - shell

I always wondered how sort works when ordering multiple columns according to their numerical values. For example:
echo -e " 2 3 \n 1 2 \n 2 10" | sort -n
produces:
1 2
2 10
2 3
and so does sort -g. If I want to order numerically the second column as well, the only solution I came up with is:
echo -e " 2 3 \n 1 2 \n 2 10" | sort -k1n -k2n
which produces the desired output:
1 2
2 3
2 10
Someone could please explain this behavior and tell if a simpler solution exists?

The POSIX specification for sort says:
-n
Restrict the sort key to an initial numeric string, consisting of optional <blank> characters, optional minus-sign, and zero or more digits with an optional radix character and thousands separators (as defined in the current locale), which shall be sorted by arithmetic value. An empty digit string shall be treated as zero. Leading zeros and signs on zeros shall not affect ordering.
This is essentially the same as saying -k1n,1. If you want to sort by multiple columns numerically, you must say so:
sort -k1n,1 -k2n,2 …
Be cautious about omitting the 'field end' after the commas.

Simpler, (the leading -k1n isn't needed), but not by much:
echo -e " 2 3 \n 1 2 \n 2 10" | sort -k2n
Output:
1 2
2 3
2 10

Related

BASH: Count occurrences of each element from list A in list B

Is there an efficient way to count the number of occurrences for each item of list A in list B? This question has been solved in different programming languages (e.g., C/C++, Java, Python) but I have not found the solution in BASH. My very naive idea is to use a nested for loop to solve it but I think there should be a better approach for this.
# input
listA=(1 2 3 4 5)
listB=(3 1 2 4 1 3 4 5 2 6 8 7 3 9 6 5 1 2)
# expected output
# 1: 3
# 2: 3
# 3: 3
# 4: 2
# 5: 2
Any comments/suggestions are appreciated!
You say in comment you are not comfortable with using for loops, so here is a solution without them:
$ join -2 2 <(printf '%s\n' "${listA[#]}" | sort) \
<(printf '%s\n' "${listB[#]}" | sort | uniq -c)
1 3
2 3
3 3
4 2
5 2
Explanation
<(...) are Bash's process substitutions. join is given pseudo-files that actually correspond to the output of the commands.
printf '%s\n' "${listA[#]}" | sort sorts the element in listA and print them one by line.
printf '%s\n' "${listB[#]}" | sort | uniq -c does the same with listB but uses uniq -c to prefix each element with its number of occurrences.
join keeps the lines in this second output that matches a line in the first output.
A solution in plain bash using an associative array, without using nested loops:
#!/bin/bash
listA=(1 2 3 4 5)
listB=(3 1 2 4 1 3 4 5 2 6 8 7 3 9 6 5 1 2)
declare -A freq # associative array to hold frequencies
for elem in "${listA[#]}"; do freq[$elem]=0; done
for elem in "${listB[#]}"; do [[ ${freq[$elem]} ]] && ((++freq[$elem])); done
for elem in "${listA[#]}"; do printf '%s: %d\n' "$elem" "${freq[$elem]}"; done
Notes:
Elements are not restricted to integers nor single-character elements;
script should work for any kind of element (including elements containing spaces, tabs, newlines... etc, except the null byte ('\0'), of course).
Its efficiency depends on how associative arrays are implemented internally in bash.
Associative arrays were introduced into bash with version 4.0.

Bash Ordering csv by colum not as expected with numbers an spaces at the end of the string [duplicate]

I have a very simple text file of 3 fields, each is separated by a space, like following:
123 15 0
123 14 0
345 12 0
345 11 0
And I issued a sort command to sort by the first column: sort -k 1 myfile. But it does not sort just by the first column. It sort by the whole line and I get the following result:
123 14 0
123 15 0
345 11 0
345 12 0
Is there anything wrong on my command or file?
You need to use:
sort -k 1,1 -s myfile
if you want to sort only on the first field. This syntax specifies the start and end field for sorting. sort -k 1 means to sort starting with the first field through to the end of the line. To ensure the lines are kept in the same order with respect to the input where the sort key is the same, you need to use a stable sort with the -s flag (GNU).
See this from the sort(1) man page:
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where
F is a field number and C a character position in the field; both are
origin 1, and the stop position defaults to the line's end.
and the info page:
The --stable (-s) option disables this last-resort comparison so that
lines in which all fields compare equal are left in their original relative
order.

Bash sort using general numeric value on alphanumeric string not returning rows sorted properly

I have a file containing three TAB-separated columns. The 1st column is a number, the second is a sequence of 8 characters followed by 1-3 digits, the 3rd is the same as the 2nd column. Here's a minimum reproducible example:
1 abceefgh10 abceefgh22
1 abceefgh10 abceefgh9
1 abceefgh11 abceefgh10
1 abceefgh13 abceefgh11
1 abceefgh14 abceefgh13
1 abceefgh15 abceefgh14
1 abceefgh17 abceefgh16
-1 abceefgh18 abceefgh17
1 abceefgh19 abceefgh18
-1 abceefgh1 abceefgh2
-1 abceefgh20 abceefgh12
1 abceefgh21 abceefgh19
1 abceefgh22 abceefgh20
-1 abceefgh23 abceefgh21
1 abceefgh24 abceefgh24
1 abceefgh2 abceefgh1
1 abceefgh3 abceefgh3
1 abceefgh5 abceefgh5
1 abceefgh6 abceefgh25
1 abceefgh6 abceefgh6
1 abceefgh7 abceefgh7
-1 abceefgh8 abceefgh3
1 abceefgh9 abceefgh8
This example is what I get when I try to sort the columns with sort -gk2.9.
To the best of my knowledge I should expect to see the second column sorted from 1 to 24, and with increasing numerical value (i.e. 1,2,3,4,... and not 1,10,2,20,..., which would result if using -n).
If I cut the 2nd column and sort it with the same command (cut -f 2 ${file} | sort -gk1.9), I actually get the sorting that I want. Am I getting something obvious wrong?
Using --debug option you can see column selection does not work as expected:
1>abceefgh10>abceefgh9
^ no match for key
specifying separator in accordance with Nahuel's comment works (sort -t $'\t' --debug -gk2.9):
1>abceefgh10>abceefgh9
__

grep: remove lines with the same number twice

I have a .txt file and on each line is some amount of numbers. What I need is to filtrate these which does not contain the same number. So I want the output to be only the lines which have all the numbers different. I have to use command grep!
Example:
File_input:
1 1 2 3 4 5
1 2 3 4 5 6
6 6 6 6 6 6
What I want
File_output:
1 2 3 4 5 6
First and third lines contains same numbers so these has to be filtrated out.
This should work for your example:
grep -v "\([0-9]\).*\1" myfile
Idea is to catch any single digit [0-9] and store it \(\) and search for the existing same pattern \1 on the same line. You can easily extend to any word made of digits.
With the given input you can use
sed -r '/([0-9]+).+\1/d' File_input
You will have problems with suubstrings: 1 matches 12 and 12 matches 1.
ou can add word boundaries \b with
sed -r '/\b([0-9]+)\b.*\b\1\b/d' File_input

How to replace all matches with an incrementing number in BASH?

I have a text file like this:
AAAAAA this is some content.
This is AAAAAA some more content AAAAAA. AAAAAA
This is yet AAAAAA some more [AAAAAA] content.
I need to replace all occurrence of AAAAAA with an incremented number, e.g., the output would look like this:
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.
How can I replace all of the matches with an incrementing number?
Here is one way of doing it:
$ awk '{for(x=1;x<=NF;x++)if($x~/AAAAAA/){sub(/AAAAAA/,++i)}}1' file
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.
A perl solution:
perl -pe 'BEGIN{$A=1;} s/AAAAAA/$A++/ge' test.dat
This might work for you (GNU sed):
sed -r ':a;/AAAAAA/{x;:b;s/9(_*)$/_\1/;tb;s/^(_*)$/0\1/;s/$/:0123456789/;s/([^_])(_*):.*\1(.).*/\3\2/;s/_/0/g;x;G;s/AAAAAA(.*)\n(.*)/\2\1/;ta}' file
This is a toy example, perl or awk would be a better fit for a solution.
The solution only acts on lines which contain the required string (AAAAAA).
The hold buffer is used as a place to keep the incremented integer.
In overview: when a required string is encountered, the integer in the hold space is incremented, appended to the current line, swapped for the required string and the process is then repeated until all occurences of the string are accounted for.
Incrementing an integer simply swaps the last digit (other than trailing 9's) for the next integer in sequence i.e. 0 to 1, 1 to 2 ... 8 to 9. Where trailing 9's occur, each trailing 9 is replaced by a non-integer character e.g '_'. If the number being incremented consists entirely of trailing 9's a 0 is added to the front of the number so that it can be incremented to 1. Following the increment operation, the trailing 9's (now _'s) are replaced by '0's.
As an example say the integer 9 is to be incremented:
9 is replaced by _, a 0 is prepended (0_), the 0 is swapped for 1 (1_), the _ is replaced by 0. resulting in the number 10.
See comments directed at #jaypal for further notes.
Maybe something like this
#!/bin/bash
NR=1
cat filename while read line
do
line=$(echo $line | sed 's/AAAAA/$NR/')
echo ${line}
NR=$((NR + 1 ))
done
Perl did the job for me
perl -pi -e 's/\b'DROP'\b/$&.'_'.++$A /ge' /folder/subfolder/subsubfolder/*
Input:
DROP
drop
$drop
$DROP
$DROP="DROP"
$DROP='DROP'
$DROP=$DROP
$DROP="DROP";
$DROP='DROP';
$DROP=$DROP;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP'
"DROP"
/DROP/
Output:
DROP_1
drop
$drop
$DROP_2
$DROP_3="DROP_4"
$DROP_5='DROP_6'
$DROP_7=$DROP_8
$DROP_9="DROP_10";
$DROP_11='DROP_12';
$DROP_13=$DROP_14;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP_15'
"DROP_16"
/DROP_17/

Resources