Sort output in bash script by number of occurances - bash

So I have a text being outputted that has http status codes in one column and an ip adress in the other. I wan't to sort this by number of occurances so that
1 2 1 3 4 5 4 4
Looks like
4 4 4 1 1 2 3 5
This is for the second column of status codes, the ip adresses dont need to be sorted in any particular order
Since 4 is the most common one it should be first and then 1 and so forth.
However all that I can find is how to use uniq for example in order to count the occurances, thereby removing duplicates and prefixing a number to each row.
The regular sort command does not support this as far as i can tell as well.
Any help would be appreciated

You can still use sort | uniq -c, then interpret the number of occurrences by printing the number the given times by looping:
tr ' ' '\n' < file \
| sort | uniq -c | sort -k1,1nr -k2n \
| while read times status ; do
for i in $(seq 1 $times); do
printf '%s ' $status
done
done

Related

How can I count and display only the words that are repeated more than once using unix commands?

I am trying to count and display only the words that are repeated more than once in a file. The basic idea is:
You are given a file with names and characters like commas, colons, slashes, etc..
Use the cut command to display only the first names in the file (other commands are also allowed).
Count and then display only the names repeated more than once.
I got to the point of counting and displaying all the names. However, I haven't found a way to display and to count only those names repeated more than once.
Here is a section of the file:
user1:x:80:200:Mia,Spurs:/home/user1:/bin/bash
user2:x:80:200:Martha,Dalton:/home/user2:/bin/bash
user3:x:80:200:Lucy,Carlson:/home/user3:/bin/bash
user4:x:80:200:Carl,Bingo:/home/user4:/bin/bash
Here is what I have been able to do:
Daniel#Daniel-MacBook-Pro Files % cut -d ":" -f 5-5 file1 | cut -d "," -f 1-1 | sort -n | uniq -c
1 Mia
3 Martha
1 Lucy
1 Carl
1 Jessi
1 Joke
1 Jim
2 Race
1 Sem
1 Shirly
1 Susan
1 Tim
You can filter out the rows with count 1 with grep.
cut -d ":" -f 5 file1 | cut -d "," -f 1 | sort | uniq -c | grep -v '^ *1 '

unix sort groups by their associated maximum value?

Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.

Minimal two column numeric input data for `sort` example, with distinct permutations

What's the least number of rows of two-column numeric input needed to produce four unique sort outputs for the following four options:
1. -sn -k1 2. -sn -k2 3. -sn -k1 -k2 4. -sn -k2 -k1 ?
Here's a 6 row example, (with 4 unique outputs):
6 5
3 7
6 3
2 7
4 4
5 2
As a convenience, a function to count those four outputs given 2 columns of numbers, (requires the moreutils pee command), which prints the number of unique outputs:
# Usage: foo c1_1 c2_1 c1_2 c2_2 ...
foo() { echo "$#" | tr -s '[:space:]' '\n' | paste - - | \
pee "sort -sn -k1 | md5sum" \
"sort -sn -k2 | md5sum" \
"sort -sn -k1 -k2 | md5sum" \
"sort -sn -k2 -k1 | md5sum" | \
sort -u | wc -l ; }
So to count the unique permutations of this input:
8 5
3 5
8 4
Run this:
foo 8 5 3 1 8 3
Output:
2
(Only two unique outputs. Not enough...)
Note: This question was inspired by the obscurity of the current version of the sort manual, specifically COLUMNS=65 man sort | grep -A 17 KEYDEF | sed 3,18d. The info sort page's treatment of KEYDEFs is much better.
KEYDEFs are more useful than they might first seem. The -u or --unique switch works nicely with the KEYDEFs, and in effect allows sort to delete unwanted redundant lines, and therefore can furnish a more concise substitute for certain sed or awk scripts and similar pipelines.
I can do it in 3 by varying the whitespace:
1 1
2 1
1 2
Your foo function doesn't produce this kind of output, but since it was only a "convenience" and not a part of the question proper, I declare this answer correct and minimal!
Sneakier version:
2 1
11 1
2 2
(The last line contains a tab; the others don't.)
With the -s option, I can't exploit non-numeric comparisons, but then I can exploit the stability of the sort:
1 2
2 1
1 1
The 1 1 line goes above both of the others if both fields are compared numerically, regardless of which comparison is done first. The ordering of the two comparisons determines the ordering of the other two lines.
On the other hand, if one of the fields isn't used for comparison, the 1 1 line stays below one of the other lines (and which one that is depends on which field is used for comparison).

Bash sort by number AND word length AND alphabetically

I have these strings in my array:
3 rere 33.33%
2 ena 22.22%
1 something 11.11%
1 som 11.11%
1 ok 11.11%
1 evo 11.11%
Expected results are:
3 rere 33.33%
2 ena 22.22%
1 something 11.11%
1 evo 11.11%
1 som 11.11%
1 ok 11.11%
They are ordered by number descending.
And I want to order them also by length of word in middle, but if words are same length, order them alphabetically.
These are not columns.
I wanted to split it in two arrays and sort them afterwards, but how to join them together?
Anyone got an idea?
You can't sort by length with sort. Let's try a Schwartzian transform:
awk '{print length($2), $0}' file | sort -k2,2nr -k1,1nr -k3,3 | cut -d" " -f2-
The awk command takes 1 something 11.11% and outputs 9 1 something 11.11%.
Then sort sorts first by the 2nd field numerically, then by the 1st field numerically, then by the 3rd field lexically.
Then cut removes the first field.
The idea behind this is very similar to the Schwartzian transform used in choroba's answer: we add a sort field (in this case the length of the second column), use it to sort, then remove it again:
while read -r col1 word rest; do
printf "%d\t%s %s %s\n" "${#word}" "$col1" "$word" "$rest"
done < infile | sort -k 2,2nr -k 1,1nr -k 3,3 | cut -f 2
This results in
3 rere 33.33%
2 ena 22.22%
1 something 11.11%
1 evo 11.11%
1 som 11.11%
1 ok 11.11%
After the while loop, the output looks like this:
4 3 rere 33.33%
3 2 ena 22.22%
9 1 something 11.11%
3 1 som 11.11%
2 1 ok 11.11%
3 1 evo 11.11%
There is a new column with the length of the string in the second column. It's tab separated for easier cutting afterwards.
For sort, we specify what to use for sorting with the -k arguments (sort doesn't care if the fields are tab or space separated): 2,2nr uses just the second field, numerically and in descending order; the same goes for 1,1nr, and 3,3 is just your standard lexical sort.
The output now looks like this:
4 3 rere 33.33%
3 2 ena 22.22%
9 1 something 11.11%
3 1 evo 11.11%
3 1 som 11.11%
2 1 ok 11.11%
Now we only have to get rid of the first column, for which we use cut and take advantage of the tab separation introduced with printf.
The Bash while loop is very slow, the Perl solution is likely orders of magnitude faster.
Perl to the rescue!
perl -l -0777 -aF'\n' -ne '
print for map join(" ", #$_),
sort { $b->[0] <=> $a->[0]
|| length($a->[1]) <=> length($b->[1])
|| $a->[1] cmp $b->[1] }
map [ split ],
#F;
' input-file
-n reads the input record by record
-0777 sets the whole file as one record
-l adds newlines to prints
-a splits the input
-F'\n' tells -a to split on newlines
each line is then split on whitespace by split, sorted numerically (<=>) by the 0th column, or by length of the 1st column, or alphabetically (cmp) by the first column

Sort integer by absolute value

I have a list of integers and i want to sort it with sort but i want to sort on the absolute value of the integers. For example 7 0 5 10 -2 should give 0 -2 5 7 10 (integers are separated on multiple lines in my file)
I don't think there is an option in sort to do that but i can't find an other command to sort lines. The -n options sort with the natural order and -g is not what i want.
I tried to look at awk but i don't know if it can help me.
Use
cat numbers.txt | sed -r 's/-([0-9]+)/\1-/g;' | sort -n | sed -r 's/([0-9]+)-/-\1/g;'
the first sed put the minus behind the digits
sort sort by number
the second sed puts the minus again in front of the digits
I can't find this documented anywhere, but when you run sort -Vd it sorts by absolute value. It's a combination of the "version sort" and "numerical sort" options. With 1 5 3 7 -2 -4 -9, version sort on it's own does something like this:
1
3
5
7
-2
-4
-9
And numerical sort on its own sorts like this;
-9
-4
-2
1
3
5
7
And with both options, it sorts like this;
1
-2
3
-4
5
7
-9
I don't know if this is by design or by accident, and I've only tested it in GNU sort. I have found this trick to be very useful for certain code golfing situations.
A one line perl solution. Works more generally on floating point values as well. For example:
$ cat numbers.txt
1 -100 5 -4 7 -9 12 25.3 1.8 -1 33.5
$ perl -lane 'print(join " ", sort {abs($a) <=> abs($b)} #F);' numbers.txt
1 -1 1.8 -4 5 7 -9 12 25.3 33.5 -100
If you want the order to be descending, just reverse the $a and $b variables.
If your file is named fname then the following should work:
paste <(sed 's/-//' fname) fname | sort -n | cut -f 2
The sed strips out the - to generate an absolute value, paste, joins the absolute value as the first column, by which is then sorted. This is then cut out.

Resources