Simpler way to count the number of duplicated rows in a text file - bash

I have a text file that looks like this:
abc
bcd
abc
efg
bcd
abc
And the expected output is this:
3 abc
2 bcd
1 efg
I know there is an existed solution for this:
sort -k2 < inFile |
awk '!z[$1]++{a[$1]=$0;} END {for (i in a) print z[i], a[i]}' |
sort -rn -k1 > outFile
The code sorts, removes duplicates, and sorts again, and prints the expected output.
However, is there a simpler way to express the z[$1]++{a[$1]=$0} part? More "basic", I mean.

More basic:
$ sort inFile | uniq -c
3 abc
2 bcd
1 efg
More basic awk
When one is used to awk's idioms, the expression !z[$1]++{a[$1]=$0;} is clear and concise. For those used to programming in other languages, other forms might be more familiar, such as:
awk '{if (z[$1]++ == 0) a[$1]=$0;} END {for (i in a) print z[i], a[i]}'
Or,
awk '{if (z[$1] == 0) a[$1]=$0; z[$1]+=1} END {for (i in a) print z[i], a[i]}'

If your input file contains billions of lines and you want to avoid sort, then you can just do:
awk '{a[$0]++} END{for(x in a) print a[x],x}' file.txt

Related

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

counting occurence of character

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.
For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2
Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt
Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.
datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

Count number of nonempty entries in each column of, e.g., comm output

The Unix command comm file1 file2 has a 3 column output with lines unique to file1 in the first column, lines unique to file2 in the second, and lines shared by both in the 3rd (assuming file1 and file2 are sorted). It ends up looking something like this:
$ echo -e "alpha\nbravo\ncharlie" > file1
$ echo -e "alpha\nbravo\ndelta" > file2
$ comm file1 file2
alpha
bravo
charlie
delta
If I want the number of nonempty lines in each column, is there a general way to parse the output of comm and count those?
I know that for comm in particular I could just run
for i in {12,23,31}; do comm -$i file1 file2 | wc -l; done
but I'm curious about solutions that take the comm output file as a starting point, for the sake of getting better at Unix command line. I added the awk tag because I have a hunch there's a good awk solution.
The other answer covers your question of using awk to do the job quite well, but it is also worth mentioning that the GNU version of comm has a --total option which will print the sum of each column in a similar manner.
You may use this awk:
comm file1 file2 |
awk -F '\t' -v OFS='\n' '{ if ($1=="") if ($2=="") c3++; else c2++; else c1++ }
END { print c3, c2, c1 }'
2
1
1
Note that output of comm is tab delimited with these cases:
1st and 2nd empty column in common lines
1st empty column in lines unique to file2
1st non-empty column in lines unique to file1
The question is interesting, but not as easy as one would imagine, especially if you do not have the --total option.
A couple of things about comm:
comm works on sorted files
if a line appears n times in file1 and m times n < m times in file2, comm will output n-m entries in column 2 and n entries in column 3.
$ comm <(echo -e "1\n2\n3") <(echo "2\n2\n3\n4")
1
2
2
3
4
comm uses <tab>-character as a default separator, processing its output becomes problematic if your input contains this character.
$ comm <(echo -e "1\t2\n3") <(echo "2\n3\n4")
1 2 << this is the weird line
2
3
4
Luckily it has an option to define the delimiter (--output-delimiter=STR)
comm only adds a delimiter if other non-empty fields are following
$ comm --output-delimiter=SEP <(echo -e "1\n2\n3") <(echo "2\n3\n4")
1 << NO SEP (1 field)
SEPSEP2 << TWO SEP (3 fields)
SEPSEP3 << TWO SEP (3 fields)
SEP4 << ONE SEP (2 fields)
How can we solve it now:
We should clearly not use an ASCII symbol as a delimiter, this is asking for problems when processing ASCII files, so what you can do is use a non-printable character as a delimiter. You could use for example <start-of-heading>-character with octal value \001 (it does not accept the <null>-character). This generally solves the issues you might have due to point (3)
$ comm --output-delimiter=$'\001' <(echo -e "1\t2\n3") <(echo "2\n3\n4")
this output can now be piped into an extremely simple awk
$ awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
the above works because of point (4).
So you can just do:
$ comm --output-delimiter=$'\001' file1 file2 \
| awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
But I don't have that --output-delimiter option: This calls for the pure awk solution. We keep track of 3 arrays. a for file1 b for file2 and c for the combination. (c keeps track of all the entries). We make sure to keep point (2) into account.
$ awk '(NR==FNR) { a[$0]++; c[$0]++ }
(NR!=FNR) { b[$0]++; c[$0]-- }
END { for(i in c) {
if (c[i] < 0) { countb+=-c[i]; countc+=a[i] }
else if (c[i] == 0) { countc+=a[i] }
else { counta+= c[i]; countc+=b[i] }
}
print counta, countb, countc
}' file1 file2
We could essentially get rid of the array b as it can be derived from a and c, but I wanted to make it a bit more clear how it works; the other version would be:
$ awk '(NR==FNR) { a[$0]++; c[$0]++; next } { c[$0]-- }
END { for(i in c) {
counta+=(c[i]>0 ? c[i] : 0)
countb-=(c[i]<0 ? c[i] : 0)
countc+=a[i] - (c[i]>0 ? c[i] : 0)
}
print counta, countb, countc
}' file1 file2
Using Perl
$ comm file1 file2 | perl -lne ' /^\t\t/ and $kv{2}++; /^\t\S+/ and $kv{1}++; /^\S+/ and $kv{3}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
or
$ comm file1 file2 | perl -lne ' /(^\t\t)|(^\t\S+)|(^.)/ and $x=$+[0]>2?3:$+[0]; $kv{$x}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
where
col-1 -> first file
col-3 -> second file
col-2 -> both file
obviously you can do all in awk without comm or requiring sorted inputs.
$ awk 'NR==FNR {a[$1]; next}
{if($1 in a) {c3++; delete a[$1]}
else c2++}
END {print length(a),c2,c3}' file1 file2
1 1 2
that's counts for file1 only, file2 only, and common.
Note, this requires that the records are unique in each file.

How I can keep only the non repeated lines in a file?

Want I want to do is simply keep the lines which are not repeated in a huge file like this:
..
a
b
b
c
d
d
..
The desired output is then:
..
a
c
..
Many thanks in advance.
uniq has arg -u
-u, --unique only print unique lines
Example:
$ printf 'a\nb\nb\nc\nd\nd\n' | uniq -u
a
c
If your data is not sorted, do sort at first
$ printf 'd\na\nb\nb\nc\nd\n' | sort | uniq -u
Preserve the order:
$ cat foo
d
c
b
b
a
d
$ grep -f <(sort foo | uniq -u) foo
c
a
greps the file for patterns obtained by aforementioned uniq. I can imagine, though, that if your file is really huge then it will take a long time.
The same without somewhat ugly Process substitution:
$ sort foo | uniq -u | grep -f- foo
c
a
This awk should work to list only lines that are not repeated in file:
awk 'seen[$0]++{dup[$0]} END {for (i in seen) if (!(i in dup)) print i}' file
a
c
Just remember that original order of lines may change due to hashing of arrays in awk.
EDIT: To preserve the original order:
awk '$0 in seen{dup[$0]; next}
{seen[$0]++; a[++n]=$0}
END {for (i=1; i<=n; i++) if (!(a[i] in dup)) print a[i]}' file
a
c
This is job that is tailor made for awk which doesn't require multiple processes, pipes and process substitution and will be more efficient for bigger files.
When your file is sorted, it's simple:
cat file.txt | uniq > file2.txt
mv file2.txt file.txt

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources