How to get the number of different lines in bash [duplicate] - bash

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2

how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2

uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).

if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1

Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2

In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done

In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

Related

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Count duplicated couple of lines

I have a configuration file with this format:
cod 11
loc1 23
pto1 33
loc2 55
pto2 66
cod 12
loc1 55
pto1 66
loc2 88
pto2 77
...
I want to count how many times a pair of numbers appear in sequence loc/pto (indipendently of loc/pto number). In the example, the couple 55/66 appears 2 times (once as loc1/pto1 and one as loc2/pto2).
I have googled around and tried some combination of grep, uniq and awk but I only managed in count single line or number duplicated. I read the man documentation of those commands not finding any clue relative to my problem.
You could use the following:
$ sort file | uniq -f1 -dc
2 loc1 55
2 pto1 66
-f1 is skipping the 1st field when comparing lines
-dc is printing duplicate line with its associated count
Despite no visible effort on the part of the OP, this was an interesting question to work out.
awk '{for (i=1 ; i < 10 ; i++) if (NR == i) array[i]=$2} END {for (i=1 ; i < 10 ; i++) print array[i] "," array[i+1]}' file | sort | uniq -c
Output-
1 11,23
1 12,55
1 23,33
1 33,55
2 55,66
1 66,12
1 66,88
1 88,
The output tells you that 55 is followed by 66 twice. Other pairs only occur once.
Explanation-
I define an array in awk whoe elements are the ith number in the second column. The part after the END concatenates the ith and i+1th element. Then there is a sort | uniq -c to see if these pairs occur more than once.
If you want to know how many times a duplicate number appeared in the file:
awk '{print $2}' <filename> | sort | uniq -dc
Output:
2 55
2 66
If you want to know how many times a number appeared in the file regardless of being duplicate or not:
awk '{print $2}' <filename> | sort | uniq -c
Output:
1 11
1 12
1 23
1 33
2 55
2 66
1 77
1 88
If you want to print the full line on duplicate match based on second column:
awk '{print $2}' <filename> | sort | uniq -d | grep -F -f - <filename>
Output:
loc2 55
pto2 66
loc1 55
pto1 66

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
10
10
10
15
15
20
20
20
20
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.

counting duplicates in a sorted sequence using command line tools

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2
how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2
uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).
if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1
Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2
In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done
In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

Resources