I would to first sort a specific column, which I do using sort -k2 <file>. Then, after it is sorted using the values from the second column, I would like to add all the values from column 1 , delete duplicates, and keep the value from column 1.
Example:
2 AAAAAA
3 BBBBBB
1 AAAAAA
2 BBBBBB
1 CCCCCC
sort -k2 <file> does this:
2 AAAAAA
1 AAAAAA
3 BBBBBB
2 BBBBBB
1 CCCCCC
I know uniq -c will removes duplicates and outputs how many times it occurred, however I don't want to know how many times it occurred, I just need column 1 to be added and displayed. So that I would get:
3 AAAAAA
5 BBBBBB
1 CCCCCC
I came up with a solution using two for loops:
The first loop loops over all different strings in the file (test.txt), for each one we find all the numbers in the original file, and add them in the second loop. After adding all numbers we echo the total, and the string.
for chars in `sort -k2 test.txt | uniq -f 1 | cut -d' ' -f 2 `;
do
total=0;
for nr in `grep $a test.txt | cut -d' ' -f 1`;
do
total=$(($total+$nr));
done;
echo $total $chars
done
-c is your enemy. You explicitly asked for the count . Here is my suggestion:
sort -k2 <file>| uniq -f1 file2
which gives me
cat file2
1 AAAAAA
2 BBBBBB
1 CCCCCC
If you want only column 2 in file, then use awk
sort -k2 <file>| uniq -f1 |awk '{print $2}' > file2
leading to
AAAAAA
BBBBBB
CCCCCC
Now I got it at last.
.... But if you want to sum in column 1, then just use awk ... Of course you could not make a grouped count with uniq...
awk '{array[$2]+=$1} END { for (i in array) {print array[i], i}}' file |sort -k2
which leads to your solution (even if I sorted afterwards):
3 AAAAAA
5 BBBBBB
1 CCCCCC
Related
I want:
$ cat file
ABCDEFG, XXX
ABCDEFG, YYY
ABCDEFG, ZZZ
AAAAAAA, XZY
BBBBBBB, XYZ
CCCCCCC, YXZ
DDDDDDD, YZX
CDEFGHI, ZYX
CDEFGHI, XZY
$ cat file | magic
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY
So, pre-sorted file goes in, identify repeats in the first column, count the number of lines of this repeat, print the repeat count plus all the repeated lines and their content, including whatever is in column 2, which can be anything and is not relevant to the unique count.
Two problems:
1) get the effect of uniq -c, but without deleting the duplicates.
My really "hacky" sed -e solution after searching online was this:
cat file | cut -d',' -f1 | uniq -c | sed -E -e 's/([0-9][0-9]*) (.*)/echo $(yes \1 \2 | head -\1)/;e' | sed -E 's/ ([0-9])/;\1/g' | tr ';' '\n'
I was surprised to see things like head -\1 working, but well great. However, I feel like there should be a much simpler solution to the problem.
2) The above get's rid of the second column. I could just run my code first, and then paste it to my second column in the original file, but the file is massive and I want things to be as speed efficient as possible.
Any suggestions?
One in awk. Pretty tired so not fully tested. I hope it works, good night:
$ awk -F, '
$1!=p {
for(i=1;i<c;i++)
print c-1,a[i]
c=1
}
{
a[c++]=$0
p=$1
}
END {
for(i=1;i<c;i++)
print c-1,a[i]
}' file
Output:
3 ABCDEFG,XXX
3 ABCDEFG,YYY
3 ABCDEFG,ZZZ
1 AAAAAAA,XZY
1 BBBBBBB,XYZ
1 CCCCCCC,YXZ
1 DDDDDDD,YZX
2 CDEFGHI,ZYX
2 CDEFGHI,XZY
Here's one way using awk that passes the file twice. On the first pass, use an associative array to store the counts of the first column. On the second pass, print the array value and the line itself:
awk -F, 'FNR==NR { a[$1]++; next } { print a[$1], $0 }' file{,}
Results:
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY
I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?
uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do
awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.
I have an output that looks like this: (number of occurrences of the word, and the word)
3 I
2 come
2 from
1 Slovenia
But I want that it looked like this:
I 3
come 2
from 2
Slovenia 1
I got my output with:
cut -d' ' -f1 "file" | uniq -c | sort -nr
I tried to do different things, with another pipes:
cut -d' ' -f1 "file" | uniq -c | sort -nr | cut -d' ' -f8 ...?
which is a good start, because I have the words on the first place..buuut I have no access to the number of occurrences?
AWK and SED are not allowed!
EDIT:
alright lets say the file looks like this.
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
I is repeated 3 times, come twice, from twice, Slovenia once. +They are on beginning of each line.
AWK and SED are not allowed!
Starting with this:
$ cat file
3 I
2 come
2 from
1 Slovenia
The order can be reversed with this:
$ while read count word; do echo "$word $count"; done <file
I 3
come 2
from 2
Slovenia 1
Complete pipeline
Let us start with:
$ cat file2
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
Using your pipeline (with two changes) combined with the while loop:
$ cut -d' ' -f1 "file2" | sort | uniq -c | sort -snr | while read count word; do echo "$word $count"; done
I 3
come 2
from 2
Slovenia 1
The one change that I made to the pipeline was to put a sort before uniq -c. This is because uniq -c assumes that its input is sorted. The second change is to add the -s option to the second sort so that the alphabetical order of the words with the same count is not lost
You can just pipe an awk after your first try:
$ cat so.txt
3 I
2 come
2 from
1 Slovenia
$ cat so.txt | awk '{ print $2 " " $1}'
I 3
come 2
from 2
Slovenia 1
If perl is allowed:
$ cat testfile
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
$ perl -e 'my %list;
while(<>){
chomp; #strip \n from the end
s/^ *([^ ]*).*/$1/; #keep only 1st word
$list{$_}++; #increment count
}
foreach (keys %list){
print "$_ $list{$_}\n";
}' < testfile
come 2
Slovenia 1
I 3
from 2
Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482
Using p.txt:
$cat p.txt
R 3
R 4
S 1
S 2
R 1
T 1
R 3
The following command sorts based on the second column:
$cat p.txt | sort -k2
R 1
S 1
T 1
S 2
R 3
R 3
R 4
The following command removes repeated values in the second column:
$cat p.txt | sort -k2 | awk '!x[$2]++'
R 1
S 2
R 3
R 4
Now inserting a comma for the sapce, we have the following file:
$cat p1.csv
R,3
R,4
S,1
S,2
R,1
T,1
R,3
The following command still sorts based on the second column:
$cat p1.csv | sort -t "," -k2
R,1
S,1
T,1
S,2
R,3
R,3
R,4
Below is NOT the correct output:
$cat p1.csv | sort -t "," -k2 | awk '!x[$2]++'
R,1
Correct output:
R,1
S,2
R,3
R,4
Any suggestions?
well you have already used sort, then you don't need the awk at all. sort has -u
Also the cat is not needed either:
sort -t, -k2 -u p1.csv
should give you expected output.
Try awk -F, in your last command. So:
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'
Since your fields are separated by commas, you need to tell awk that the field separator is no longer whitespace, but instead the comma. The -F option to awk does that.
Well you don't need all such things, sort and uniq are enough to do such things
sort -t "," -k2 p1.csv | uniq -s 2
uniq -s 2 tells uniq to skip first 2 characters (i.e. till ,)
You need to provide field separator for awk
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'