I have an output that looks like this: (number of occurrences of the word, and the word)
3 I
2 come
2 from
1 Slovenia
But I want that it looked like this:
I 3
come 2
from 2
Slovenia 1
I got my output with:
cut -d' ' -f1 "file" | uniq -c | sort -nr
I tried to do different things, with another pipes:
cut -d' ' -f1 "file" | uniq -c | sort -nr | cut -d' ' -f8 ...?
which is a good start, because I have the words on the first place..buuut I have no access to the number of occurrences?
AWK and SED are not allowed!
EDIT:
alright lets say the file looks like this.
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
I is repeated 3 times, come twice, from twice, Slovenia once. +They are on beginning of each line.
AWK and SED are not allowed!
Starting with this:
$ cat file
3 I
2 come
2 from
1 Slovenia
The order can be reversed with this:
$ while read count word; do echo "$word $count"; done <file
I 3
come 2
from 2
Slovenia 1
Complete pipeline
Let us start with:
$ cat file2
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
Using your pipeline (with two changes) combined with the while loop:
$ cut -d' ' -f1 "file2" | sort | uniq -c | sort -snr | while read count word; do echo "$word $count"; done
I 3
come 2
from 2
Slovenia 1
The one change that I made to the pipeline was to put a sort before uniq -c. This is because uniq -c assumes that its input is sorted. The second change is to add the -s option to the second sort so that the alphabetical order of the words with the same count is not lost
You can just pipe an awk after your first try:
$ cat so.txt
3 I
2 come
2 from
1 Slovenia
$ cat so.txt | awk '{ print $2 " " $1}'
I 3
come 2
from 2
Slovenia 1
If perl is allowed:
$ cat testfile
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
$ perl -e 'my %list;
while(<>){
chomp; #strip \n from the end
s/^ *([^ ]*).*/$1/; #keep only 1st word
$list{$_}++; #increment count
}
foreach (keys %list){
print "$_ $list{$_}\n";
}' < testfile
come 2
Slovenia 1
I 3
from 2
Related
I want count the number of occurrences of part of a filename when doing ls.
For example if my directory has the following files:
apple.cool_test1
banana.cool_test1
banana.cool_test2
cherry.cool_test1
cherry.cool_test2
cherry.cool_test3
I want the result like this:
1 apple
2 banana
3 cherry
So I tried "ls | sort | uniq -c" but how do I extract the first part of the filename. My record separator can be "." ?
give this one-liner a try:
$ awk -F'.' '{a[$1]++}END{for(x in a)print a[x],x}' file
1 apple
2 banana
3 cherry
You can extract the first part with cut or awk:
$ printf '%s\n' * | cut -d'.' -f1 | uniq -c
1 apple
2 banana
3 cherry
$ printf '%s\n' * | awk -F'.' '{print $1}' | uniq -c
1 apple
2 banana
3 cherry
I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`
I would to first sort a specific column, which I do using sort -k2 <file>. Then, after it is sorted using the values from the second column, I would like to add all the values from column 1 , delete duplicates, and keep the value from column 1.
Example:
2 AAAAAA
3 BBBBBB
1 AAAAAA
2 BBBBBB
1 CCCCCC
sort -k2 <file> does this:
2 AAAAAA
1 AAAAAA
3 BBBBBB
2 BBBBBB
1 CCCCCC
I know uniq -c will removes duplicates and outputs how many times it occurred, however I don't want to know how many times it occurred, I just need column 1 to be added and displayed. So that I would get:
3 AAAAAA
5 BBBBBB
1 CCCCCC
I came up with a solution using two for loops:
The first loop loops over all different strings in the file (test.txt), for each one we find all the numbers in the original file, and add them in the second loop. After adding all numbers we echo the total, and the string.
for chars in `sort -k2 test.txt | uniq -f 1 | cut -d' ' -f 2 `;
do
total=0;
for nr in `grep $a test.txt | cut -d' ' -f 1`;
do
total=$(($total+$nr));
done;
echo $total $chars
done
-c is your enemy. You explicitly asked for the count . Here is my suggestion:
sort -k2 <file>| uniq -f1 file2
which gives me
cat file2
1 AAAAAA
2 BBBBBB
1 CCCCCC
If you want only column 2 in file, then use awk
sort -k2 <file>| uniq -f1 |awk '{print $2}' > file2
leading to
AAAAAA
BBBBBB
CCCCCC
Now I got it at last.
.... But if you want to sum in column 1, then just use awk ... Of course you could not make a grouped count with uniq...
awk '{array[$2]+=$1} END { for (i in array) {print array[i], i}}' file |sort -k2
which leads to your solution (even if I sorted afterwards):
3 AAAAAA
5 BBBBBB
1 CCCCCC
I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?
uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do
awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.
I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`