How to sort elements alphabetically in a list that have same numeric values - bash

First of all, I mustn't use awk or sed!
I counted the number of times a command was used in bash history. I've used the following code:
cat ${!#} | cut -f 1 -d " " | sort | uniq -c | sort -n | sort -r
Note that cat ${!#} is history.txt by deafult and I used cut to get rid of the numbered lines. Also I had to use sort -r at the end, otherwise I'd get the output ascending rather then descending (why is that?).
Ok my main question is sorting the commands with the same number of repetitions. Currently I'm getting this output:
seq 3
ps 2
echo 2
cut 2
uname 1
nl 1
cmp 1
But I need to get the following output:
seq 3
cut 2
echo 2
ps 2
cmp 1
diff 1
nl 1
uname 1
So yeah. Sort the commands by number of repetitions and if any of them are the same, sort them alphabetically.
I tried to tackle this with another sort, but haven't had much success thus far.
Also is it just a coincidence that commands in my output are reversed from the desired output? Or are they sorted descending by default?

Try this with GNU sort:
sort -k2nr -k1,1 file
Output:
seq 3
cut 2
echo 2
ps 2
cmp 1
nl 1
uname 1

Related

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4
In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.
You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66
The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

Bash - Compare rows then print just original rows

I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.
This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.

uniq -c unable to count unique lines

I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?
uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do
awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.

Unix - randomly select lines based on column values

I have a file with ~1000 lines that looks like this:
ABC C5A 1
CFD D5G 4
E1E FDF 3
CFF VBV 1
FGH F4R 2
K8K F9F 3
... etc
I would like to select 100 random lines, but with 10 of each third column value (so random 10 lines from all lines with value "1" in column 3, random 10 lines from all lines with value "2" in column 3, etc).
Is this possible using bash?
First grep all the files with a certain number, shuffle them and pick the first 10 using shuf -n 10.
for i in {1..10}; do
grep " ${i}$" file | shuf -n 10
done > randomFile
If you don't have shuf, use sort -R to randomly sort them instead:
for i in {1..10}; do
grep " ${i}$" file | sort -R | head -10
done > randomFile
If you can use awk, you can do the same with a one-liner
sort -R file | awk '{if (count[$3] < 10) {count[$3]++; print $0}}'

generating frequency table from file

Given an input file containing one single number per line, how could I get a count of how many times an item occurred in that file?
cat input.txt
1
2
1
3
1
0
desired output (=>[1,3,1,1]):
cat output.txt
0 1
1 3
2 1
3 1
It would be great, if the solution could also be extended for floating numbers.
You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.
sort -n input.txt | uniq -c
Another option:
awk '{n[$1]++} END {for (i in n) print i,n[i]}' input.txt | sort -n > output.txt
At least some of that can be done with
sort output.txt | uniq -c
But the order number count is reversed. This will fix that problem.
sort test.dat | uniq -c | awk '{print $2, $1}'
Using maphimbu from the Debian stda package:
# use 'jot' to generate 100 random numbers between 1 and 5
# and 'maphimbu' to print sorted "histogram":
jot -r 100 1 5 | maphimbu -s 1
Output:
1 20
2 21
3 20
4 21
5 18
maphimbu also works with floating point:
jot -r 100.0 10 15 | numprocess /%10/ | maphimbu -s 1
Output:
1 21
1.1 17
1.2 14
1.3 18
1.4 11
1.5 19
In addition to the other answers, you can use awk to make a simple graph. (But, again, it's not a histogram.)
perl -lne '$h{$_}++; END{for $n (sort keys %h) {print "$n\t$h{$n}"}}' input.txt
Loop over each line with -n
Each $_ number increments hash %h
Once the END of input.txt has been reached,
sort {$a <=> $b} the hash numerically
Print the number $n and the frequency $h{$n}
Similar code which works on floating point:
perl -lne '$h{int($_)}++; END{for $n (sort {$a <=> $b} keys %h) {print "$n\t$h{$n}"}}' float.txt
float.txt
1.732
2.236
1.442
3.162
1.260
0.707
output:
0 1
1 3
2 1
3 1
I had a similar problem as described, but across gigabytes of gzip'd log files. Because many of these solutions necessitated waiting until all the data was parsed, I opted to write rare to quickly parse and aggregate data based on a regexp.
In the case above, it's as simple as passing in the data to the histogram function:
rare histo input.txt
# OR
cat input.txt | rare histo
# Outputs:
1 3
0 1
2 1
3 1
But it can also handle more complex cases via regex/expressions, such as:
rare histo --match "(\d+)" --extract "{1}" input.txt

Resources