How to grep number of unique occurrences - bash

I understand that grep -c string can be used to count the occurrences of a given string. What I would like to do is count the number of unique occurrences when only part of the string is known or remains constant.
For Example, if I had a file (in this case a log) with several lines containing a constant string and a repeating variable like so:
string=value1
string=value1
string=value1
string=value2
string=value3
string=value2
Than I would like to be able to identify the number of each unique set with an output similar to the following: (ideally with a single grep/awk string)
value1 = 3 occurrences
value2 = 2 occurrences
value3 = 1 occurrences
Does anyone have a solution using grep or awk that might work? Thanks in advance!

This worked perfectly... Thanks to everyone for your comments!
grep -oP "wwn=[^,]*" path/to/file | sort | uniq -c

In general, if you want to grep and also keep track of results, it is best to use awk since it performs such things in a clear manner with a very simple syntax.
So for your given file I would use:
$ awk -F= '/string=/ {count[$2]++} END {for (i in count) print i, count[i]}' file
value1 3
value2 2
value3 1
What is this doing?
-F=
set the field separator to =, so that we can compute the right and left part of it.
/string=/ {count[$2]++}
when the pattern "string=" is found, check it! This uses an array count[] to keep track on the times the second field has appeared so far.
END {for (i in count) print i, count[i]}
at the end, loop through the results and print them.

Here's an awk script:
#!/usr/bin/awk -f
BEGIN {
file = ARGV[1]
while ((getline line < file) > 0) {
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
if (line ~ p) {
a[p] += !a[p, line]++
}
}
}
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
printf("%s = %d occurrences\n", p, a[p])
}
exit
}
Example:
awk -f script.awk somefile ab sh
Output:
ab = 7 occurrences
sh = 2 occurrences

Related

Computing the size of array in text file in bash

I have a text file that sometimes-not always- will have an array with a unique name like this
unique_array=(1,2,3,4,5,6)
I would like to find the size of the array-6 in the above example- when it exists and skip it or return -1 if it doesnt exist.
grepping the file will tell me if the array exists but not how to find its size.
The array can fill multiple lines like
unique_array=(1,2,3,
4,5,6,
7,8,9,10)
Some of the elements in the array can be negative as in
unique_array=(1,2,-3,
4,5,6,
7,8,-9,10)
awk -v RS=\) -F, '/unique_array=\(/ {print /[0-9]/?NF:0}' file.txt
-v RS=\) - delimit records by ) instead of newlines
-F, - delimit fields by , instead of whitespace
/unique_array=(/ - look for a record containing the unique identifier
/[0-9]?NF:0 - if record contains digit, number of fields (ie. commas+1), otherwise 0
There is a bad bug in the code above: commas preceding the array may be erroneously counted. A fix is to truncate the prefix:
awk -v RS=\) -F, 'sub(/.*unique_array=\(/,"") {print /[0-9]/?NF:0}' file.txt
Your specifications are woefully incomplete, but guessing a bit as to what you are actually looking for, try this at least as a starting point.
awk '/^unique_array=\(/ { in_array = 1; n = split(",", arr, $0); next }
in_array && /\)/ { sub(/\)./, ""); quit = 1 }
in_array { n += split(",", arr, $0);
if (quit) { print n; in_array = quit = n = 0 } }' file
We keep a state variable in_array which tells us whether we are currently in a region which contains the array. This gets set to 1 when we see the beginning of the array, and back to 0 when we see the closing parenthesis. At this point, we remove the closing parenthesis and everything after it, and set a second variable quit to trigger the finishing logic in the next condition. The last condition performs two tasks; it adds the items from this line to the count in n, and then checks if quit is true; if it is, we are at the end of the array, and print the number of elements.
This will simply print nothing if the array was not found. You could embellish the script to set a different exit code or print -1 if you like, but these details seem like unnecessary complications for a simple script.
I think what you probably want is this, using GNU awk for multi-char RS and RT and word boundaries:
$ awk -v RS='\\<unique_array=[(][^)]*)' 'RT{exit} END{print (RT ? gsub(/,/,"",RT)+1 : -1)}' file
With your shown samples please try following awk.
awk -v RS= '
{
while(match($0,/\<unique_array=[(][^)]*\)/)){
line=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]*\n[[:space:]]*|(^|\n)unique_array=\(|(\)$|\)\n)/,"",line)
print gsub(/,/,"&",line)+1
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Using sed and declare -a. The test file is like this:
$ cat f
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
Testing:
$ declare -a "$(sed -n '/unique_array=(/,/)/s/,/ /gp' f | \
sed 's/.*\(unique_array\)/\1/;s/).*/)/;
s/`.*`//g')"
$ echo ${unique_array[#]}
1 2 3 4 5 6 7 8 9 10
And then you can do whatever you want with ${unique_array[#]}
With GNU grep or similar that support -z and -o options:
grep -zo 'unique_array=([^)]*)' file.txt | tr -dc =, | wc -c
-z - (effectively) treat file as a single line
-o - only output the match
tr -dc =, - strip everything except = and ,
wc -c - count the result
Note: both one- and zero-element arrays will be treated as being size 1. Will return 0 rather than -1 if not found.
here's an awk solution that works with gawk, mawk 1/2, and nawk :
TEST INPUT
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
CODE
{m,n,g}awk '
BEGIN { __ = "-1:_ERR_NOT_FOUND_"
RS = "^$" (_ = OFS = "")
FS = "(^|[ \t-\r]?)unique[_]array[=][(]"
___ = "[)].*$|[^0-9,.+-]"
} $!NF = NR < NF ? $(gsub(___,_)*_) : __'
OUTPUT
1,2,3,4,5,6,7,8,9,10

Replace values in text file for one batch with AWK and increment subsequent value from last one

I have the following in a text file called data.txt
&st=1000&type=rec&uniId=5800000000&acceptCode=1000&drainNel=supp&
&st=1100&type=rec&uniId=5800000000&acceptCode=1000&drainNel=supp&
&st=4100&type=rec&uniId=6500000000&acceptCode=4100&drainNel=ured&
&st=4200&type=rec&uniId=6500000000&acceptCode=4100&drainNel=iris&
&st=4300&type=rec&uniId=6500000000&acceptCode=4100&drainNel=iris&
&st=8300&type=rec&uniId=7700000000&acceptCode=8300&drainNel=teef&
1) Script will take an input argument in the form of a number, e.g: 979035210000000098
2) I want to replace all the text value for uniId=xxxxxxxxxx with the given long number passed in the argument to script. IMPORTANT: if uniID is same, it will replace same value for all of them. (In this case, first two lines are same, then next three lines are same, then last one is same) For the next batch, it will replace + increment (5,000,000,000) from last one
Ignore all other fields and they should not be modified.
So essentially doing this:
./script.sh 979035210000000098
.. still confused? Well, the final result could be this:
&st=1000&type=rec&uniId=979035210000000098&acceptCode=1000&drainNel=supp&
&st=1100&type=rec&uniId=979035210000000098&acceptCode=1000&drainNel=supp&
&st=4100&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=ured&
&st=4200&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=iris&
&st=4300&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=iris&
&st=8300&type=rec&uniId=979035220000000098&acceptCode=8300&drainNel=teef&
This ^ should be REPLACED and applied to tempfile datanew.txt - not just print on screen.
An AWK script exists which does replacement for &st=xxx and for &acceptCode=xxx but perhaps I can reuse, not able to get it working as I expect?
# $./script.sh [STARTCOUNT] < data.txt > datanew.txt
# $ mv -f datanew.txt data.txt
awk -F '&' -v "cnt=${1:-10000}" -v 'OFS=&' \
'NR == 1 { ac = cnt; uni = $4; }
NR > 1 && $4 == uni { cnt += 100 }
$4 != uni { cnt += 5000000000; ac = cnt; uni = $4 }
{ $2 = "st=" cnt; $5 = "acceptCode=" ac; print }'
Using gnu awk you may use this:
awk -M -i inplace -v num=979035210000000098 'BEGIN{FS=OFS="&"}
!seen[$4]++{p = (NR>1 ? p+5000000000 : num)} {$4="uniId=" p} 1' file
&st=1000&type=rec&uniId=979035210000000098&acceptCode=1000&drainNel=supp&
&st=1100&type=rec&uniId=979035210000000098&acceptCode=1000&drainNel=supp&
&st=4100&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=ured&
&st=4200&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=iris&
&st=4300&type=rec&uniId=979035215000000098&acceptCode=4100&drainNel=iris&
&st=8300&type=rec&uniId=979035220000000098&acceptCode=8300&drainNel=teef&
Options -M or --bignum forces arbitrary precision arithmetic on numbers in gnu awk.

end result of bash command with a dot (.)

I have a bash script that greps and sorts information from /etc/passwd here
export FT_LINE1=13
export FT_LINE2=23
cat /etc/passwd | grep -v "#" | awk 'NR%2==1' | cut -f1 -d":" | rev | sort -r | awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'NR>=l1 && NR<=l2' | tr '\n' ',' | sed 's/, */, /g'
The result is this list
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_,
How can i replace the last comma with a dot (.)? I want it to look like this
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_.
You can add:
| sed 's/,$/./'
(where $ means "end of line").
There are way to many pipes in your command, some of them can be removed.
As explained in the comment cat <FILE> | grep is a bad habit!!! In general, cat <FILE> | cmd should be replaced by cmd <FILE> or cmd < FILE depending on what type of arguments your command does accept.
On a few GB size file to process, you will already feel the difference.
This being said, you can do the whole processing without using a single pipe by using awk for example:
awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'function reverse(s){p=""; for(i=length(s); i>0; i--){p=p substr(s,i,1);}return p;}BEGIN{cmp=0; FS=":"; ORS=","}!/#/{cmp++;if(cmp%2==1) a[cmp]=reverse($1);}END{asort(a);for(i=length(a);i>0;i--){if((length(a)-i+1)>=l1 && (length(a)-i)<=l2){if(i==1){ORS=".";}print a[i];}}}' /etc/passwd
Explanations:
# BEGIN rule(s)
BEGIN {
cmp = 0 #to be use to count the lines since NR can not be used directly
FS = ":" #file separator :
ORS = "," #output record separator ,
}
# Rule(s)
! /#/ { #for lines that does not contain this char
cmp++
if (cmp % 2 == 1) {
a[cmp] = reverse($1) #add to an array the reverse of the first field
}
}
# END rule(s)
END {
asort(a) #sort the array and process it in reverse order
for (i = length(a); i > 0; i--) {
# apply your range conditions
if (length(a) - i + 1 >= l1 && length(a) - i <= l2) {
if (i == 1) { #when we reach the last character to print, instead of the comma use a dot
ORS = "."
}
print a[i] #print the array element
}
}
}
# Functions, listed alphabetically
#if the reverse operation is necessary then you can use the following function that will reverse your strings.
function reverse(s)
{
p = ""
for (i = length(s); i > 0; i--) {
p = p substr(s, i, 1)
}
return p
}
If you don't need to reverse part you can just remove it from the awk script.
In the end, not a single pipe is used!!!

Remove the last-occured lines of patterns

I want to exclude/delete the last line of pattern {n}{n}{n}.log for each possible 3-digit numbers. Each lines end with a sample pattern "123.log".
Sample input file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aa112.log
aaa116.log
a113.log
aaaaa116.log
aaa113.log
aa114.log
Output file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log
How could this be performed by bash scripting?
It is fairly simple to remove the last matching line in awk without retaining order.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
if (t in a)
print a[t];
a[t] = $0;
}'
To keep the output ordered is more complicated, and requires more memory.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
a[++i] = $0;
b[$0] = t;
c[t] = i;
}
END {
for (n = 1; n <= i; n++)
if (n != c[b[a[n]]])
print a[n];
}'
To pass through non-matching lines in the first example a next statement can be added to the action, and a pattern of 1 can be appended. For the second example assignment into array a can be moved to its own action.
Probably awk would be the easiest tool for this. For example, this one-liner
tac file | awk 'match($0, /[0-9]{3}.log/,a) && a[0] in b; {b[a[0]]}' | tac
produces the requested output for the sample input. This does not require the entire file to be stored in memory.
Change the regular expression to suit your specific needs.
$ awk '{k=substr($0,length()-7)} NR==FNR{n[k]=NR;next} FNR!=n[k]' file file
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log

Storing Mulitdimensional array using awk

My input file is
a|b|c|d
w|r|g|h
i want to store the value in array like
a[1,1] = a
a[1,2] = b
a[2,1] = w
Kindly suggest in any way to achieve this in awk bash.
I have two i/p files and need to do field level validation.
Like this
awk -F'|' '{for(i=1;i<=NF;i++)a[NR,i]=$i}
END {print a[1,1],a[2,2]}' file
Output
a r
This parses the file into an awk array:
awk -F \| '{ for(i = 1; i <= NF; ++i) a[NR,i] = $i }' filename
You'll have to add code that uses the array for this to be of any use, of course. Since you didn't say what you wanted to do with the array once it is complete (after the pass over the file), this is all the answer i can give you.
You're REALLY going to want to get/use gawk 4.* if you're using multi-dimensional arrays as that's the only awk that supports them. When you write:
a[1,2]
in any awk you are actually creating a psedudo-multi-dimensional array which is a 1-dimensional array indexed by the string formed by the concatenation of
1 SUBSEP 2
where SUBSEP is a control char that's unlikely to appear in your input.
In GNU awk 4.* you can do:
a[1][2]
(note the different syntax) and that populates an actual multi-dimentional array.
Try this to see the difference:
$ cat tst.awk
BEGIN {
SUBSEP=":" # just to make it visible when printing
oneD[1,2] = "a"
oneD[1,3] = "b"
twoD[1][2] = "c"
twoD[1][3] = "d"
for (idx in oneD) {
print "oneD", idx, oneD[idx]
}
print ""
for (idx1 in twoD) {
print "twoD", idx1
for (idx2 in twoD[idx1]) { # you CANNOT do this with oneD
print "twoD", idx1, idx2, twoD[idx1][idx2]
}
}
}
$ awk -f tst.awk
oneD 1:2 a
oneD 1:3 b
twoD 1
twoD 1 2 c
twoD 1 3 d

Resources