how to compare total in unix - bash

i have a file simple.txt. with contents as below:
a b
c d
c d
I want to check which pair 'a b' or 'c d' has maximum occurrence? I have written this code which gives me output of individual occurrence of each word :
cat simple.txt | tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c |
grep -E -i "\<a\>|\<b\>|\<c\>|\<d\>"
1 a
1 b
2 c
2 d
how can i total the result of this output? or can i write a different code?

If we can assume that each pair of letters is a complete line, one way to handle this would be to sort the lines, use the uniq utility to get a count of each unique line, and then reverse sort to get the count:
sort simple.txt | uniq -c | sort -rn
You may want to get rid of the empty lines using egrep:
egrep '\w' simple.txt | sort | uniq -c | sort -rn
Which should give you:
2 c d
1 a b

$ sort file |
uniq -c |
sort -nr > >(read -r count pair; echo "max count $count is for pair $pair")
sort, count numerically in descending order, read the first and print the results.
or all the above in one awk script...
$ awk '{c[$0]++}
END{n=asorti(c,ci); k=ci[n];
print "max count is " c[k] " for pair " k}' file

With single GNU awk command:
awk 'BEGIN{ PROCINFO["sorted_in"] = "#val_num_desc" }
NF{ a[$0]++ }
END{ for (i in a) { print "The pair with max occurence is:", i; break } }' file
The output:
The pair with max occurence is: c d

To get the pair that occurs most frequently:
$ sort <simple.txt | uniq -c | sort -nr | awk '{print "The pair with max occurence is",$2,$3; exit}'
The pair with max occurence is c d
This can be done entirely by awk and without any need for pipelines:
$ awk '{a[$0]++} END{for (x in a) if (a[x]>(max+0)) {max=a[x]; line=x}; print "The pair with max occurence is",line}' simple.txt
The pair with max occurence is c d

Related

BASH choosing and counting distinct based on two column

Hey guys so i got this dummy data:
115,IROM,1
125,FOLCOM,1
135,SE,1
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
144,BLIZZARC,1
166,STEAD,3
166,STEAD,3
166,STEAD,3
168,BANDOI,1
179,FOX,1
199,C4,2
199,C4,2
Desired output:
IROM,1
FOLCOM,1
SE,1
ATLUZ,3
BLIZZARC,1
STEAD,1
BANDOI,1
FOX,1
C4,1
which comes from counting the distinct game id (the 115,125,etc). so for example the
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
Will be
ATLUZ,3
Since it have 3 distinct game id
I tried using
cut -d',' -f 2 game.csv|uniq -c
Where i got the following output
1 IROM
1 FOLCOM
1 SE
5 ATLUZ
1 BLIZZARC COMP
3 STEAD
1 BANDOI
1 FOX
2 C4
How do i fix this ? using bash ?
Before executing the cut command, do a uniq. This will remove the redundant lines and then you follow your command, i.e. apply cut to extract 2 field and do uniq -c to count character
uniq game.csv | cut -d',' -f 2 | uniq -c
Could you please try following too in a single awk.
awk -F, '
!a[$1,$2,$3]++{
b[$1,$2,$3]++
}
!f[$2]++{
g[++count]=$2
}
END{
for(i in b){
split(i,array,",")
c[array[2]]++
}
for(q=1;q<=count;q++){
print c[g[q]],g[q]
}
}' SUBSEP="," Input_file
It will give the order of output same as Input_file's 2nd field occurrence as follows.
1 IROM
1 FOLCOM
1 SE
3 ATLUZ
1 BLIZZARC
1 STEAD
1 BANDOI
1 FOX
1 C4
Using GNU datamash:
datamash -t, --sort --group 2 countunique 1 < input
Using awk:
awk -F, '!a[$1,$2]++{b[$2]++}END{for(i in b)print i FS b[i]}' input
Using sort, cut, uniq:
sort -u -t, -k2,2 -k1,1 input | cut -d, -f2 | uniq -c
Test run:
$ cat input
111,ATLUZ,1
121,ATLUZ,1
121,ATLUZ,2
142,ATLUZ,2
115,IROM,1
142,ATLUZ,2
$ datamash -t, --sort --group 2 countunique 1 < input
ATLUZ,3
IROM,1
As you can see, 121,ATLUZ,1 and 121,ATLUZ,2 are correctly considered to be just one game ID.
Less elegant, but you may use awk as well. If it is not granted that the same ID+NAME combos will always come consecutively, you have to count each by reading the whole file before output:
awk -F, '{c[$1,$2]+=1}END{for (ck in c){split(ck,ca,SUBSEP); print ca[2];g[ca[2]]+=1}for(gk in g){print gk,g[gk]}}' game.csv
This will count first every [COL1,COL2] pairs then for each COL2 it counts how many distinct [COL1,COL2] pairs are nonzero.
This also does the trick. The only thing is that your output is not sorted.
awk 'BEGIN{ FS = OFS = "," }{ a[$2 FS $1] }END{ for ( i in a ){ split(i, b, "," ); c[b[1]]++ } for ( i in c ) print i, c[i] }' yourfile
Output:
BANDOI,1
C4,1
STEAD,1
BLIZZARC,1
FOLCOM,1
ATLUZ,3
SE,1
IROM,1
FOX,1

How I can keep only the non repeated lines in a file?

Want I want to do is simply keep the lines which are not repeated in a huge file like this:
..
a
b
b
c
d
d
..
The desired output is then:
..
a
c
..
Many thanks in advance.
uniq has arg -u
-u, --unique only print unique lines
Example:
$ printf 'a\nb\nb\nc\nd\nd\n' | uniq -u
a
c
If your data is not sorted, do sort at first
$ printf 'd\na\nb\nb\nc\nd\n' | sort | uniq -u
Preserve the order:
$ cat foo
d
c
b
b
a
d
$ grep -f <(sort foo | uniq -u) foo
c
a
greps the file for patterns obtained by aforementioned uniq. I can imagine, though, that if your file is really huge then it will take a long time.
The same without somewhat ugly Process substitution:
$ sort foo | uniq -u | grep -f- foo
c
a
This awk should work to list only lines that are not repeated in file:
awk 'seen[$0]++{dup[$0]} END {for (i in seen) if (!(i in dup)) print i}' file
a
c
Just remember that original order of lines may change due to hashing of arrays in awk.
EDIT: To preserve the original order:
awk '$0 in seen{dup[$0]; next}
{seen[$0]++; a[++n]=$0}
END {for (i=1; i<=n; i++) if (!(a[i] in dup)) print a[i]}' file
a
c
This is job that is tailor made for awk which doesn't require multiple processes, pipes and process substitution and will be more efficient for bigger files.
When your file is sorted, it's simple:
cat file.txt | uniq > file2.txt
mv file2.txt file.txt

Counting First Letter that occur in a line and showing the summary in shell/linux using grep

I have a log that look like this
I:5000:GAME
I:5000:GAME
I-:5000:GAME
I-:5000:GAME
E:5000:GAME
E:5000:GAME
E:5000:GAME
E:5000:GAME
E:5000:GAME
J:5000:GAME
J:5000:GAME
J:5000:GAME
L:5000:GAME
M:5000:GAME
K:5000:GAME
What I wanted to do is count line that starts with letter E,I-,J and sort it in Descending order.
SAMPLE OUTPUT
5 E
3 J
2 I-
this is what im trying to type
sort /home/prod-dev/progex_logs.txt | egrep '^E|^I|^J' | cut -f1 -d: | uniq -c
my file is progex_logs.txt but its not showing the answer that i want
Try this:
sort data | cut -f1 -d: | uniq -c
This sorts the input data lexically, extracts just the first column, and then pipes the result to uniq -c, which collapses duplicate lines and calculates a count of how many lines were collapsed. Given you sample input, this generates:
5 E
2 I-
2 I
3 J
1 K
1 L
1 M
If you just want E, I-, and J, you can filter those out using the egrep command that user2254435 posted, like this:
sort data | egrep '^I-|^E|^J' | cut -f1 -d: | uniq -c
Which would get you:
5 E
2 I-
3 J
So what does this do?
The first command:
sort data
Generates a lexically sorted version of the data. Given your sample
input, we get:
E:5000:GAME
E:5000:GAME
E:5000:GAME
E:5000:GAME
E:5000:GAME
I-:5000:GAME
I-:5000:GAME
I:5000:GAME
I:5000:GAME
J:5000:GAME
J:5000:GAME
J:5000:GAME
K:5000:GAME
L:5000:GAME
M:5000:GAME
We then pipe the output to the next command, cut -f1 -d:, using
the | operator, which lets us send stdout from one command to
stdin of another command. This commands reads as input the output
from the sort command, and the extracts the first (-f1)
colon-delimited field (-d:). This gives us:
E
E
E
E
E
I-
I-
I
I
J
J
J
K
L
M
We then pipe the output to uniq -c, which collapses duplicate lines
and generates a count of how many lines were collapsed. So, given
input like:
E
E
E
E
E
Running uniq -c gives us:
5 E
For more information about all this, see the man pages for sort,
cut, and uniq.
You can do this using awk
cat ip.txt | awk 'BEGIN{IC=0;JC=0;EC=0}{if(index($0,"I-")>0)IC++;else if(index($0,"E:")>0)EC++;else if(index($0,"J:")>0)JC++;}END{printf("I- %d\n",IC);printf("E: %d\n",EC);printf("J: %d\n",JC);}'
egrep '^I-|^E|^J' test.dat|wc -l
This does an extended grep using the given pattern. The result is passed on to the wc command to count the resulting lines from the egrep.
This is a common way to do it with awk
awk -F: '/^(E|I-|J)/ {a[$1]++} END {for (i in a) print i,a[i]}' file
I- 2
E 5
J 3
To get it sorted, you can do:
awk -F: '/^(E|I-|J)/ {a[$1]++} END {for (i in a) print i,a[i]}' file | sort -rk2
E 5
J 3
I- 2
How it works
awk -F: ' # setting field separator to :
/^(E|I-|J)/ { # run this section only if starting with E, I- or J
a[$1]++} # add field #1 to array a
END { # end section
for (i in a) # looping through all element in array
print i,a[i]} # print element name and the number of elements
' file | sort -rk2 # sort output by column #2 in reverse order

Cut | Sort | Uniq -d -c | but?

The given file is in the below format.
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.
3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.
By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...
What I have done is..
awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c
but this needs a point back again to the original file to get the lines for the dup occurences. ..
let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
sort has an option -k
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
uniq has an option -f
-f, --skip-fields=N
avoid comparing the first N fields
so sort and uniq with field numbers(count NUM and test this cmd yourself, plz)
awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
Using awk's associative arrays is a handy way to find unique/duplicate rows:
awk '
BEGIN {FS = OFS = ","}
{
key = $1 FS $2 FS $5 FS $14
if (key in count)
count[key]++
else {
count[key] = 1
line[key] = $0
}
}
END {for (key in count) print count[key], line[key]}
' filename
SYNTAX :
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt
Calculation:
sym#localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS
sym#localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

Counting unique strings where there's a single string per line in bash

Given input file
z
b
a
f
g
a
b
...
I want to output the number of occurrences of each string, for example:
z 1
b 2
a 2
f 1
g 1
How can this be done in a bash script?
You can sort the input and pass to uniq -c:
$ sort input_file | uniq -c
2 a
2 b
1 f
1 g
1 z
If you want the numbers on the right, use awk to switch them:
$ sort input_file | uniq -c | awk '{print $2, $1}'
a 2
b 2
f 1
g 1
z 1
Alternatively, do the whole thing in awk:
$ awk '
{
++count[$1]
}
END {
for (word in count) {
print word, count[word]
}
}
' input_file
f 1
g 1
z 1
a 2
b 2
cat text | sort | uniq -c
should do the job
Try:
awk '{ freq[$1]++; } END{ for( c in freq ) { print c, freq[c] } }' test.txt
Where test.txt would be your input file.
Here's a bash-only version (requires bash version 4), using an associative array.
#! /bin/bash
declare -A count
while read val ; do
count[$val]=$(( ${count[$val]} + 1 ))
done < your_intput_file # change this as needed
for key in ${!count[#]} ; do
echo $key ${count[$key]}
done
This might work for you:
cat -n file |
sort -k2,2 |
uniq -cf1 |
sort -k2,2n |
sed 's/^ *\([^ ]*\).*\t\(.*\)/\2 \1/'
This output the number of occurrences of each string in the order in which they appear.
You can use sort filename | uniq -c.
Have a look at the Wikipedia page on uniq.

Resources