generating frequency table from file - bash

Given an input file containing one single number per line, how could I get a count of how many times an item occurred in that file?
cat input.txt
1
2
1
3
1
0
desired output (=>[1,3,1,1]):
cat output.txt
0 1
1 3
2 1
3 1
It would be great, if the solution could also be extended for floating numbers.

You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.
sort -n input.txt | uniq -c

Another option:
awk '{n[$1]++} END {for (i in n) print i,n[i]}' input.txt | sort -n > output.txt

At least some of that can be done with
sort output.txt | uniq -c
But the order number count is reversed. This will fix that problem.
sort test.dat | uniq -c | awk '{print $2, $1}'

Using maphimbu from the Debian stda package:
# use 'jot' to generate 100 random numbers between 1 and 5
# and 'maphimbu' to print sorted "histogram":
jot -r 100 1 5 | maphimbu -s 1
Output:
1 20
2 21
3 20
4 21
5 18
maphimbu also works with floating point:
jot -r 100.0 10 15 | numprocess /%10/ | maphimbu -s 1
Output:
1 21
1.1 17
1.2 14
1.3 18
1.4 11
1.5 19

In addition to the other answers, you can use awk to make a simple graph. (But, again, it's not a histogram.)

perl -lne '$h{$_}++; END{for $n (sort keys %h) {print "$n\t$h{$n}"}}' input.txt
Loop over each line with -n
Each $_ number increments hash %h
Once the END of input.txt has been reached,
sort {$a <=> $b} the hash numerically
Print the number $n and the frequency $h{$n}
Similar code which works on floating point:
perl -lne '$h{int($_)}++; END{for $n (sort {$a <=> $b} keys %h) {print "$n\t$h{$n}"}}' float.txt
float.txt
1.732
2.236
1.442
3.162
1.260
0.707
output:
0 1
1 3
2 1
3 1

I had a similar problem as described, but across gigabytes of gzip'd log files. Because many of these solutions necessitated waiting until all the data was parsed, I opted to write rare to quickly parse and aggregate data based on a regexp.
In the case above, it's as simple as passing in the data to the histogram function:
rare histo input.txt
# OR
cat input.txt | rare histo
# Outputs:
1 3
0 1
2 1
3 1
But it can also handle more complex cases via regex/expressions, such as:
rare histo --match "(\d+)" --extract "{1}" input.txt

Related

Bash - Compare rows then print just original rows

I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.
This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.

How to sort elements alphabetically in a list that have same numeric values

First of all, I mustn't use awk or sed!
I counted the number of times a command was used in bash history. I've used the following code:
cat ${!#} | cut -f 1 -d " " | sort | uniq -c | sort -n | sort -r
Note that cat ${!#} is history.txt by deafult and I used cut to get rid of the numbered lines. Also I had to use sort -r at the end, otherwise I'd get the output ascending rather then descending (why is that?).
Ok my main question is sorting the commands with the same number of repetitions. Currently I'm getting this output:
seq 3
ps 2
echo 2
cut 2
uname 1
nl 1
cmp 1
But I need to get the following output:
seq 3
cut 2
echo 2
ps 2
cmp 1
diff 1
nl 1
uname 1
So yeah. Sort the commands by number of repetitions and if any of them are the same, sort them alphabetically.
I tried to tackle this with another sort, but haven't had much success thus far.
Also is it just a coincidence that commands in my output are reversed from the desired output? Or are they sorted descending by default?
Try this with GNU sort:
sort -k2nr -k1,1 file
Output:
seq 3
cut 2
echo 2
ps 2
cmp 1
nl 1
uname 1

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

How can I use awk to sort columns by the last value of a column?

I have a file like this (with hundreds of lines and columns)
1 2 3
4 5 6
7 88 9
and I would like to re-order columns basing on the last line values (or a specific line values)
1 3 2
4 6 5
7 9 88
How can I use awk (or other) to accomplish this task?
Thank you in advance for your help
EDIT: I would like to thank everybody and to apologize if I wasn't enough clear.
What I would like to do is:
take a line (for example the last one);
reorder the columns of the matrix using the sorted values of the chosen line to derermine the order.
So, the last line is 7 88 9, which sorted is 7 9 88, then the three columns have to be reordered in a way such that, in this case, the last two columns are swapped.
A four-column more generic example, based on the last line again:
Input:
1 2 3 4
4 5 6 7
7 88.0 9 -3
Output:
4 1 3 2
7 4 6 5
-3 7 9 88.0
Here's a quick, dirty and improvable solution: (edited because OP clarified that numbers are floating point).
$ cat test.dat
1 2 3
4 5 6
.07 .88 -.09
$ awk "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
3 1 2
6 4 5
-.09 .07 .88
To see what's going on there (or at least part of it):
$ echo "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
{print $3,$1,$2} test.dat
To make it work for an arbitrary line, replace tail -n1 with tail -n+$L|head -n1
This problem can be elegantly solved using GNU awk's array sorting feature. GNU awk allows you to control array traversal using PROCINFO. So two passes of the file are required, the first pass to split the last record into an array and the second pass to loop through the indices of the array in value order and output fields based on indices. The code below probably explains it better than I do.
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"};
NR == FNR {for (x in arr) delete arr[x]; split($0, arr)};
NR != FNR{sep=""; for (x in arr) {printf sep""$x; sep=" "} print ""}' file.txt file.txt
4 1 3 2
7 4 6 5
-3 7 9 88.0
Update:
Create a file called transpose.awk like this:
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str OFS a[i,j];
}
print str
}
}
Now here is the script that should do work for you:
awk -f transpose.awk file | sort -n -k $(awk 'NR==1{print NF}' file) | awk -f transpose.awk
1 3 2
4 6 5
7 9 88
I am using transpose.awk twice here. Once to transpose rows to columns then I am doing numeric sorting by last column and then again I am transposing rows to columns. It may not be most efficient solution but it is something that works as per the OP's requirements.
transposing awk script courtesy of: #ghostdog74 from An efficient way to transpose a file in Bash

Unix - randomly select lines based on column values

I have a file with ~1000 lines that looks like this:
ABC C5A 1
CFD D5G 4
E1E FDF 3
CFF VBV 1
FGH F4R 2
K8K F9F 3
... etc
I would like to select 100 random lines, but with 10 of each third column value (so random 10 lines from all lines with value "1" in column 3, random 10 lines from all lines with value "2" in column 3, etc).
Is this possible using bash?
First grep all the files with a certain number, shuffle them and pick the first 10 using shuf -n 10.
for i in {1..10}; do
grep " ${i}$" file | shuf -n 10
done > randomFile
If you don't have shuf, use sort -R to randomly sort them instead:
for i in {1..10}; do
grep " ${i}$" file | sort -R | head -10
done > randomFile
If you can use awk, you can do the same with a one-liner
sort -R file | awk '{if (count[$3] < 10) {count[$3]++; print $0}}'

Resources