I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
To see a frequency count for column two (for example):
awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr
fileA.txt
z z a
a b c
w d e
fileB.txt
t r e
z d a
a g c
fileC.txt
z r a
v d c
a m c
Result:
3 d
2 r
1 z
1 m
1 g
1 b
Here is a way to do it in the shell:
FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr
This is the sort of thing bash is great at.
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.
Here goes:
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
# columnvalues.pl
while (<>) {
#Fields = split /\s+/;
for $i ( 0 .. $#Fields ) {
$result[$i]{$Fields[$i]}++
};
}
for $j ( 0 .. $#result ) {
print "column $j:\n";
#values = keys %{$result[$j]};
#sorted = sort { $result[$j]{$b} <=> $result[$j]{$a} || $a cmp $b } #values;
for $k ( #sorted ) {
print " $k $result[$j]{$k}\n"
}
}
Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the #Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by #Dennis
column 0:
a 3
z 3
t 1
v 1
w 1
column 1:
d 3
r 2
b 1
g 1
m 1
z 1
column 2:
c 4
a 3
e 2
.csv input
If your input files are .csv, change /\s+/ to /,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
Ruby(1.9+)
#!/usr/bin/env ruby
Dir["*"].each do |file|
h=Hash.new(0)
open(file).each do |row|
row.chomp.split("\t").each do |w|
h[ w ] += 1
end
end
h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end
Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!
$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$
Pure-Bash version:
FIELD=1
declare -A results
while read -a line; do
results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[#]#A}
The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:
$FIELD is the selected column number
${line[$FIELD]} is the column value from that line in the file
${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)
To have the output sorted in the expected OP format, a little more work is needed:
sort -rn < <(
for k in "${!results[#]}"; do
echo "${results[$k]} $k";
done
)
Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.
Related
I have the following three column data (first row is header) in a csv format
Value
Y
X
A
8
2
B
3
5
C
7
9
I want the following output also in a csv format
Value
Y*X
AB
40
BA
6
AC
72
CA
14
BC
27
CB
35
Is there a way to accomplish this in bash?
thank you
Here is the csv file copy-paste
VALUE,Y,X
A,13,7
C,0,0
D,3,25
E,2,44
F,0,6
H,1,1
I,5,3
K,45,3
L,1,31
M,2,3
N,3,3
P,113,87
Q,13,11
R,20,5
S,7,9
T,9,4
V,7,3
Y,1,0
I tried awk '{print $2*$3}' TEST.dat but the problem is it is not combinatorial i.e. not multiplying every column 2 value to every column 3 value.
Using awk:
awk -F, 'BEGIN { print "VALUE,Y*X"; i=0 } # Print header
FNR == 1 { next } # Skip existing header lines
FNR == NR { x[++i]=$3; values[i]=$1; next } # First pass through the file
{ # Second pass; multiply current row against every saved row
for (n = 1; n <= i; n++)
if (values[n] != $1) # Except itself
printf "%s%s,%d\n", $1, values[n], $2 * x[n]
}' input.csv input.csv
Process the file twice; first time saving the x values, and second time multiplying the current line's y against all the saved x's.
For fun, a version that uses sqlite, importing the CSV file and then doing a self-join:
sqlite3 -batch -header -csv <<EOF
.import input.csv data
SELECT a.value || b.value AS "VALUE", a.y * b.x AS "Y*X"
FROM data AS a
JOIN data AS b ON a.value <> b.value
ORDER BY a.rowid, b.rowid;
EOF
And pure bash:
#!/usr/bin/env bash
declare -a values yvalues xvalues
exec 3<input.csv
read -r -u 3 _ # Read and discard header
declare -i i=0
while IFS=, read -r -u 3 value y x; do
i+=1
values[i]=$value
yvalues[i]=$y
xvalues[i]=$x
done
echo "VALUE,Y*X"
for ((a=1; a<=i; a++)); do
for ((b=1; b<=i; b++)); do
if [[ $a -ne $b ]]; then
printf "%s%s,%d\n" "${values[a]}" "${values[b]}" \
"$(( yvalues[a] * xvalues[b] ))"
fi
done
done
I have a array=(4,2,8,9,1,0) and I don't want to sort the array to find the highest number in the array because I need to get the index value of the highest number as it is, so I can use it for further reference.
Expected output:
9 index value => 3
Can somebody help me to achieve this?
Slight variation with a loop using the ternary conditional operator and no assumptions about range of values:
arr=(4 2 8 9 1 0)
max=${arr[0]}
maxIdx=0
for ((i = 1; i < ${#arr[#]}; ++i)); do
maxIdx=$((arr[i] > max ? i : maxIdx))
max=$((arr[i] > max ? arr[i] : max))
done
printf '%s index => values %s\n' "$maxIdx" "$max"
The only assumption is that array indices are contiguous. If they aren't, it becomes a little more complex:
arr=([1]=4 [3]=2 [5]=8 [7]=9 [9]=1 [11]=0)
indices=("${!arr[#]}")
maxIdx=${indices[0]}
max=${arr[maxIdx]}
for i in "${indices[#]:1}"; do
((arr[i] <= max)) && continue
maxIdx=$i
max=${arr[i]}
done
printf '%s index => values %s\n' "$maxIdx" "$max"
This first gets the indices into a separate array and sets the initial maximum to the value corresponding to the first index; then, it iterates over the indices, skipping the first one (the :1 notation), checks if the current element is a new maximum, and if it is, stores the index and the maximum.
Without using sort, you can use a simple loop in shell. Here is a sample bash code:
#!/usr/bin/env bash
array=(4 2 8 9 1 0)
for i in "${!array[#]}"; do
[[ -z $max ]] || (( ${array[i]} > $max )) && { max="${array[i]}"; maxind=$i; }
done
echo "max=$max, maxind=$maxind"
max=9, maxind=3
arr=(4 2 8 9 1 0)
paste <(printf "%s\n" "${arr[#]}") <(seq 0 $((${#arr[#]} - 1)) ) |
sort -k1,1 |
tail -n1 |
sed 's/\t/ index value => /'
Print each array element on a newline with printf
Print array indexes with seq
Join both streams using paste
Numerically sort the lines using the first fields (ie. array value) sort
Print the last line tail -n1
The array value and result is separated by a tab. Substitute tab with the output string you want using sed. One could use ex. cut -d, -f2 to get only the index or use read a b <( ... ) to read the numbers into variables, etc.
Using Perl
$ export data=4,2,8,9,1,0
$ echo $data | perl -ne ' map{$i++; if($_>$x) {$x=$_;$id=$i} } split(","); print "max=$x", " index=",--${id},"\n" '
max=9 index=3
$
I have a list of names, which are out of order. How can I get them in the correct alphanumeric order, using a custom sort order for the alphabetical part?
My file numbers.txt:
alpha-1
beta-3
alpha-10
beta-5
alpha-5
beta-1
gamma-7
gamma-1
delta-10
delta-2
The main point is that my script should recognize that it should print alpha before beta, and beta before gamma, and gamma before delta.
That is, the words should be sorted based on the order of the letters in the Greek alphabet they represent.
Expected order:
alpha-1
alpha-5
alpha-10
beta-1
beta-3
beta-5
gamma-1
gamma-7
delta-2
delta-10
PS: I tried with sort -n numbers.txt, but it doesn't fit my need.
You can use an auxiliary awk command as follows:
awk -F- -v keysInOrder="alpha,beta,gamma,delta" '
BEGIN {
split(keysInOrder, a, ",")
for (i = 1; i <= length(a); ++i) keysToOrdinal[a[i]] = i
}
{ print keysToOrdinal[$1] "-" $0 }
' numbers.txt | sort -t- -k1,1n -k3,3n | cut -d- -f2-
The awk command is used to:
map the custom keys onto numbers that reflect the desired sort order; note that the full list of keys must be passed via variable keysInOrder, in order.
prepend the numbers to the input as an auxiliary column, using separator - too; e.g., beta-3 becomes 2-beta-3, because beta is in position 2 in the ordered list of sort keys.
sort then sorts awk's output by the mapped numbers as well as the original number in the 2nd column, yielding the desired custom sort order.
cut then removes the aux. mapped numbers again.
Here's a Python solution. Don't try to do hard things with Bash, sed, awk. You can usually accomplish what you want, but it'll be more confusing, more error prone, and harder to maintain.
#!/usr/bin/env python3
# Read input lines
use_stdin = True
if use_stdin:
import sys
lines = sys.stdin.read().strip().split()
else:
# for testing
with open('numbers.txt') as input:
lines = input.read().strip().split()
# Create a map from greek letters to integers for sorting
greek_letters = """alpha beta gamma delta epsilon zeta
eta theta iota kappa lambda mu
nu xi omicron pi rho sigma
tau upsilon phi chi psi omega"""
gl = greek_letters.strip().split()
gl_map = {letter:rank for rank, letter in enumerate(gl)}
# Split each line into (letter, number)
a = (x.split('-') for x in lines)
b = ((s, int(n)) for s,n in a)
# Using an order-preserving sort, sort by number, then letter
by_number = lambda x: x[1]
by_greek_letter = lambda x: gl_map.get(x[0])
c = sorted(sorted(b, key=by_number), key=by_greek_letter)
# Re-assemble and print
for s,n in c:
print('-'.join((s, str(n))))
I would reach for Perl here. This script will work:
#!/usr/bin/env perl
use v5.14; # turn on modern features
# Greek alphabet
my #greek_letters =qw(alpha beta gamma delta epsilon zeta
eta theta iota kappa lambda mu
nu xi omicron pi rho sigma
tau upsilon phi chi psi omega);
# An inverted map from letter name to position number;
# $number{alpha} = 1, $number{beta} = 2, etc:
my %number;
#number{#greek_letters} = 1..#greek_letters;
# Read the lines to sort
chomp(my #lines = <>);
# split on hyphen into arrays of individual fields
my #rows = map { [ split /-/ ] } #lines;
# prepend the numeric position of each item's Greek letter
my #keyed = map { [ $number{$_->[0]}, #$_ ] } #rows;
# sort by Greek letter position (first field, index 0) and then
# by final number (third field, index 2)
my #sorted = sort { $a->[0] <=> $b->[0]
|| $a->[2] <=> $b->[2] } #keyed;
# remove the extra field we added
splice(#$_, 0, 1) for #sorted;
# combine the fields back into strings and print them out
say join('-', #$_) for #sorted;
Save the Perl code into a file (say, greeksort.pl) and run perl greeksort.pl numbers.txt to get your sorted output.
Generic solution:
sort -t- -k 1,1 -k 2,2n numbers.txt
Below script will work for custom requirement. It is not the best solution.
Result will be again stored in numbers.txt
#!/bin/bash
sort -t- -k 1,1 -k 2,2n numbers.txt > new_test.txt
while IFS= read -r i
do
if [[ $i == *"delta"* ]]
then
echo $i >> temp_file
else
echo $i >> new_numbers.txt
fi
done < new_test.txt
cat temp_file >> new_numbers.txt
cat new_numbers.txt > numbers.txt
rm -rf new_test.txt
rm -rf temp_file
rm -rf new_numbers.txt
If you have access to awk and sed then try this
Adding changes for Greek ordering..
cat test.txt | awk -F "-" '{ printf "%s-%0100i\n" , $1, $2 }' | \
sed 's/^alpha-\(.*\)$/01-\1/' | \
sed 's/^beta-\(.*\)$/02-\1/' | \
sed 's/^gamma-\(.*\)$/03-\1/' | \
sed 's/^delta-\(.*\)$/04-\1/' | \
sort | \
sed 's/\(.*\)-\([0]*\)\(.*\)/\1-\3/' | \
sed 's/^01-\(.*\)$/alpha-\1/' | \
sed 's/^02-\(.*\)$/beta-\1/' | \
sed 's/^03-\(.*\)$/gamma-\1/' | \
sed 's/^04-\(.*\)$/delta-\1/'
Don't try to do hard things with Bash, sed, awk
yeah, use an actuall shell and non-gnu userland commands. not much easier to code in the first place but at least won't be prone to random bugs introduced by idiotic maintainers who do not have a clue regarding backwards compatibility
Reading a text file into an array, extracting elements and sorting them is taking a very long time.
The text file is ffmpeg console output for R128 audio analysis. I need to get the highest M and S values. Example:
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.49998 M: -22.2 S: -29.9 I: -27.0 LUFS LRA: 9.8 LU FTPK: -12.4 dBFS TPK: -9.7 dBFS
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.69998 M: -22.5 S: -28.6 I: -25.9 LUFS LRA: 11.3 LU FTPK: -12.7 dBFS TPK: -9.7 dBFS
The text file can be hundreds or thousands of lines long depending on the duration of the audio file being analysed
I want to find the highest M (-22.2) and S Values (-28.6) and assign them to variables M and S
This is what I am using currently:
ARRAY=()
while read LINE
do
ARRAY+=("$LINE")
done < $tempDir/text.txt
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n ‘/B:/p' | sed 's/S:.*//' | sed -n -e 's/^.*M://p' | sed -n -e 's/-//p' >>/$tempDir/R128M.txt
done
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n '/M:/p' | sed 's/I:.*//' | sed -n -e 's/^.*S://p' | sed -n -e 's/-//p' >>$tempDir/R128S.txt
done
cat $tempDir/R128M.txt
M=( $(sort $tempDir/R128M.txt) )
cat $tempDir/R128S.txt
S=( $(sort $tempDir/R128S.txt) )
Is there a faster way of doing this?
Rather than reading in the whole file in memory, writing bits of it out to separate file, and reading those in again, just parse it and pick out the largest values:
$ awk '$7 > m || m == "" { m = $7 } $9 > s || s == "" { s = $9 } END { print m, s }' data
-22.2 -28.6
In your data, field 7 and 9 contains the values of M and S. The awk script will update its m and s variables if it finds larger values in these fields and then print the largest found at the end. The m == "" and s == "" are needed to trigger initialization of the values if no values has been read yet.
Another way with awk, which may look cleaner:
$ awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { print m, s }' data
To assign them to M and S in the shell:
$ declare $( awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { printf("M=%f S=%f\n", m, s) }' data )
$ echo $M $S
-22.200000 -28.600000
Adjust the printf() format to use %s instead of %f if you want the original strings instead of float values, or set the number of decimals you might want with, e.g., %.2f in place of %f.
First of all, three-process pipe is a bit redundant for a single value extraction, especially taking into account you reinstantiate it anew for every line.
Next, you save all the values into a file and then sort that file, while all you need is the maximum value. You can easily find it during the very first (value extraction) loop, for additional O(N) running time, instead of I/O and sorting with all the I/O overhead and O(NlogN) sorting expenses. See ARITHMETIC EXPANSION and conditional expressions in bash manual.
I work in telecoms and regularly need to expand number ranges.
For example, 6121234567X [note that there are 10 numbers preceeding the X] is shorthand for:
61212345670
61212345671
61212345672....... etc (a 10 number range)
and 612123456X [note that there are only 9 numbers preceeding the X] is shorthand for
61212345600
61212345601....... etc (a 100 number range)
So I need a grep command that...
reads how many characters in the line preceeding the X (to determine how many suffixes)
writes the appropriate amount of lines (10, 100, or 100) with ascending suffixes
hopefully removes the original line
Below is the Python script that does it, file-name is the expected first argument. Example usage: python script.py file.in > file.out
#!/usr/bin/env python
import sys
def generate(pattern):
p = pattern.lower().find('x')
ret = ""
for i in range(10**(10-p+1)):
ret += pattern[:p] + str(i).zfill(10-p+1) + " "
return ret
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Filename needed!")
else:
with open(sys.argv[1]) as f:
for ln in f:
print(generate(ln.rstrip()))
You can do this in awk quite quickly:
awk -v val=$a -v max=10
'BEGIN {
gsub("X","",val)
items=max - length(val)
for (i=0; i<=10^items; i++)
print val*(10^items)+i
}'
This works as an example. To do the same reading from a file, you just need to play with $1 (first field of the field) instead of val and move all the code from BEGIN into the main block.
Explanation
-v val=$a -v max=10 pass parameters: $a is the variable containing the string on the form 12345678X AND max contains the maximum amount of digits the number will have (10 in your case).
BEGIN {} perform all these actions [before/without] reading a file.
gsub("X","",val) remove X from val.
items=max - length(val) count the size of the variable without the X.
for (i=0; i<=10^items; i++) print val*(10^items)+i loop from 0 to 10^remaining_size. This means from 0 to 10 or from 0 to 100... depending on the result of 10 - size without X.
Test
With 9 as maximum:
$ a=12345678X
$ awk -v val=$a -v max=9 'BEGIN {gsub("X","",val); items=max - length(val); for (i=0; i<=10^items; i++) print val*(10^items)+i}'
123456780
123456781
123456782
123456783
123456784
123456785
123456786
123456787
123456788
123456789
123456790
echo 6121234567X | perl -nE 'm/(.*)X/;
say $1. $_ foreach (0..10**(11-length $1)-1)'
61212345670
61212345671
61212345672
61212345673
61212345674
61212345675
61212345676
61212345677
61212345678
61212345679
It's a little uglier to get the zero padded format:
echo 611234567X | perl -wne 'm/(.*)X/; $b=$1; $r=11 - length $b;
$fmt="%0" . $r . "s\n";
printf "$b$fmt", $_ foreach (0..10**$r-1) '