Sort command strange behaviour - sorting

Input file: salary.txt
1 rob hr 10000
2 charls it 20000
4 kk Fin 30000
5 km it 30000
6 kl it 30000
7 mark hr 10000
8 kc it 30000
9 dc fin 40000
10 mn hr 40000
3 abi it 20000
objective: find all record with second highest salary where 4rthcolumn is salary (space separated record)
I ran two similar commands but the output is entirely different, What is that I am missing?
Command1 :
sort -nr -k4,4 salary.txt | awk '!a[$4]{a[$4]=$4;t++}t==2'
output:
8 kc it 30000
6 kl it 30000
5 km it 30000
4 kk Fin 30000
command2:
cat salary.txt | sort -nr -k4,4 | awk '!a[$4]{a[$4]=$4;t++}t==2' salary.txt
output:
2 charls it 20000
the difference in the two commands is only the way salary.txt is read but why the output is entirely different

Because in the second form awk will read directly from salary.txt - which you are passing as the name of the input file - ignoring the output from sort that you are passing to stdin. Leave out the final salary.txt in command2 and you'll see that the output matches that of command1. In fact, sort behaves the same way and the forms:
cat salary.txt | sort
echo "string that will be ignored" | sort salary.txt
will both yield the exact same output.

In your second command does not, awk does not read from stdin. If you change it to
cat salary.txt | sort -nr -k4,4 | awk '!a[$4]{a[$4]=$4;t++}t==2'
you get the same result

Related

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Finding second highest salary using awk

I have a file as follows
1 rob hr 10000
2 charls it 20000
4 kk Fin 30000
5 km it 30000
6 kl it 30000
7 mark hr 10000
8 kc it 30000
9 dc fin 40000
10 mn hr 40000
3 abi it 20000
where the 4rth column contains the salary of an individuals in column 2. I want to get all the records with 2nd highest salary (or nth highest salary to be general).
Sample output :
4 kk Fin 30000
5 km it 30000
8 kc it 30000
6 kl it 30000
I have tried this :
sort -k4,4 employee.txt | awk 'NR==1{a=$4;next}{if($4>a){print $0 ;exit} else next;}' | a=`awk '{ print $4}'` | awk -v b=$a '$4==b' < cat employee.txt
but this is not giving any output . Any smart suggestions please ?
awk to the rescue!
sort -k4nr file |
awk '!($4 in a){c++; a[$4]} c==2'
4 kk Fin 30000
5 km it 30000
6 kl it 30000
8 kc it 30000
For highest salary you can simply use
sort text.txt | awk '!($4 in a){c++; a[$4]} c==2'

Minimal two column numeric input data for `sort` example, with distinct permutations

What's the least number of rows of two-column numeric input needed to produce four unique sort outputs for the following four options:
1. -sn -k1 2. -sn -k2 3. -sn -k1 -k2 4. -sn -k2 -k1 ?
Here's a 6 row example, (with 4 unique outputs):
6 5
3 7
6 3
2 7
4 4
5 2
As a convenience, a function to count those four outputs given 2 columns of numbers, (requires the moreutils pee command), which prints the number of unique outputs:
# Usage: foo c1_1 c2_1 c1_2 c2_2 ...
foo() { echo "$#" | tr -s '[:space:]' '\n' | paste - - | \
pee "sort -sn -k1 | md5sum" \
"sort -sn -k2 | md5sum" \
"sort -sn -k1 -k2 | md5sum" \
"sort -sn -k2 -k1 | md5sum" | \
sort -u | wc -l ; }
So to count the unique permutations of this input:
8 5
3 5
8 4
Run this:
foo 8 5 3 1 8 3
Output:
2
(Only two unique outputs. Not enough...)
Note: This question was inspired by the obscurity of the current version of the sort manual, specifically COLUMNS=65 man sort | grep -A 17 KEYDEF | sed 3,18d. The info sort page's treatment of KEYDEFs is much better.
KEYDEFs are more useful than they might first seem. The -u or --unique switch works nicely with the KEYDEFs, and in effect allows sort to delete unwanted redundant lines, and therefore can furnish a more concise substitute for certain sed or awk scripts and similar pipelines.
I can do it in 3 by varying the whitespace:
1 1
2 1
1 2
Your foo function doesn't produce this kind of output, but since it was only a "convenience" and not a part of the question proper, I declare this answer correct and minimal!
Sneakier version:
2 1
11 1
2 2
(The last line contains a tab; the others don't.)
With the -s option, I can't exploit non-numeric comparisons, but then I can exploit the stability of the sort:
1 2
2 1
1 1
The 1 1 line goes above both of the others if both fields are compared numerically, regardless of which comparison is done first. The ordering of the two comparisons determines the ordering of the other two lines.
On the other hand, if one of the fields isn't used for comparison, the 1 1 line stays below one of the other lines (and which one that is depends on which field is used for comparison).

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
10
10
10
15
15
20
20
20
20
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.

Resources