Count Occurrences of a text in a file with a shell command - bash

Question seems simple, but there is a twist here.
Consider a file with with data :
A,B
A,C
A,D
D,A
C,A
B,A
Here, I need a bash command which gives the count of occurrences taking
A,B
B,A
as a single count. Hence total count for this example should be 3 and not 6.

Essentially the same as the other answers but it figures out the order of the components for hashing:
$ awk -F, '!(($(($1<$2)+1),$(($2<=$1)+1)) in a){a[$(($1<$2)+1),$(($2<=$1)+1)];c++}END{print c}' file
3
Explained
$ awk -F, '
!( ( $(($1<$2)+1), $(($2<=$1)+1) ) in a ) {
a[$(($1<$2)+1),$(($2<=$1)+1)]
c++
}
END { print c }' file
$1<$2 is either 0 or 1, therefore ($1<$2)+1 is 1 or 2 and $(($1<$2)+1) either $1 or $2. Same applies to the other component $(($2<=$1)+1), it's either $2 or $1. So, it's referencing a[$1,$2] or a[$2,$1]. Tested with:
A,A
A,A
That <= could be just < in the latter component, leading to a[$1,$1] if $1==$2.

awk to the rescue!
$ awk -F, '!(($1,$2) in a){a[$1,$2];a[$2,$1];c++} END{print c}' file

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Computing the size of array in text file in bash

I have a text file that sometimes-not always- will have an array with a unique name like this
unique_array=(1,2,3,4,5,6)
I would like to find the size of the array-6 in the above example- when it exists and skip it or return -1 if it doesnt exist.
grepping the file will tell me if the array exists but not how to find its size.
The array can fill multiple lines like
unique_array=(1,2,3,
4,5,6,
7,8,9,10)
Some of the elements in the array can be negative as in
unique_array=(1,2,-3,
4,5,6,
7,8,-9,10)
awk -v RS=\) -F, '/unique_array=\(/ {print /[0-9]/?NF:0}' file.txt
-v RS=\) - delimit records by ) instead of newlines
-F, - delimit fields by , instead of whitespace
/unique_array=(/ - look for a record containing the unique identifier
/[0-9]?NF:0 - if record contains digit, number of fields (ie. commas+1), otherwise 0
There is a bad bug in the code above: commas preceding the array may be erroneously counted. A fix is to truncate the prefix:
awk -v RS=\) -F, 'sub(/.*unique_array=\(/,"") {print /[0-9]/?NF:0}' file.txt
Your specifications are woefully incomplete, but guessing a bit as to what you are actually looking for, try this at least as a starting point.
awk '/^unique_array=\(/ { in_array = 1; n = split(",", arr, $0); next }
in_array && /\)/ { sub(/\)./, ""); quit = 1 }
in_array { n += split(",", arr, $0);
if (quit) { print n; in_array = quit = n = 0 } }' file
We keep a state variable in_array which tells us whether we are currently in a region which contains the array. This gets set to 1 when we see the beginning of the array, and back to 0 when we see the closing parenthesis. At this point, we remove the closing parenthesis and everything after it, and set a second variable quit to trigger the finishing logic in the next condition. The last condition performs two tasks; it adds the items from this line to the count in n, and then checks if quit is true; if it is, we are at the end of the array, and print the number of elements.
This will simply print nothing if the array was not found. You could embellish the script to set a different exit code or print -1 if you like, but these details seem like unnecessary complications for a simple script.
I think what you probably want is this, using GNU awk for multi-char RS and RT and word boundaries:
$ awk -v RS='\\<unique_array=[(][^)]*)' 'RT{exit} END{print (RT ? gsub(/,/,"",RT)+1 : -1)}' file
With your shown samples please try following awk.
awk -v RS= '
{
while(match($0,/\<unique_array=[(][^)]*\)/)){
line=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]*\n[[:space:]]*|(^|\n)unique_array=\(|(\)$|\)\n)/,"",line)
print gsub(/,/,"&",line)+1
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Using sed and declare -a. The test file is like this:
$ cat f
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
Testing:
$ declare -a "$(sed -n '/unique_array=(/,/)/s/,/ /gp' f | \
sed 's/.*\(unique_array\)/\1/;s/).*/)/;
s/`.*`//g')"
$ echo ${unique_array[#]}
1 2 3 4 5 6 7 8 9 10
And then you can do whatever you want with ${unique_array[#]}
With GNU grep or similar that support -z and -o options:
grep -zo 'unique_array=([^)]*)' file.txt | tr -dc =, | wc -c
-z - (effectively) treat file as a single line
-o - only output the match
tr -dc =, - strip everything except = and ,
wc -c - count the result
Note: both one- and zero-element arrays will be treated as being size 1. Will return 0 rather than -1 if not found.
here's an awk solution that works with gawk, mawk 1/2, and nawk :
TEST INPUT
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
CODE
{m,n,g}awk '
BEGIN { __ = "-1:_ERR_NOT_FOUND_"
RS = "^$" (_ = OFS = "")
FS = "(^|[ \t-\r]?)unique[_]array[=][(]"
___ = "[)].*$|[^0-9,.+-]"
} $!NF = NR < NF ? $(gsub(___,_)*_) : __'
OUTPUT
1,2,3,4,5,6,7,8,9,10

How to select two specific lines with awk?

/!\ The question is basically solved, see my own answer below for more details and a subsidiary question /!\
I'm trying to add two lines based on specific word, but all I could find is adding everything after some pattern: How to select lines between two marker patterns which may occur multiple times with awk/sed
Which is not what I'm looking after.
Consider the following output:
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
What I'm trying to get is something like cmq + fwp, which output should be:
9
I do know how to add values, but I'm missing the select line containing cmq, then select line containing fwp part.
So, is there a way awk could strictly select two specific lines independently (then add them) ?
Edit:
As far as I know, matching words is awk '/cmq/', but I need to do that for let's say "fwp" too so I can add them.
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
Explanation:
awk '$1 ~ /^(cmq|fwp)$/{ # look for the match in first field
sum+=$2 # sum up 2nd field ($2) value,where sum is variable
}
END{ # at the end
print sum # print variable sum
}' infile
Test Results:
$ cat infile
aji 1
bsk 2
cmq 3
doh 4
enr 5
fwp 6
gzx 7
$ awk '$1 ~ /^(cmq|fwp)$/{sum+=$2} END { print sum}' infile
9
Now, for a more generic way this time -which even works for subtracting-:
awk '/cmq/{x=$2} /fwp/{y=$2} END {print x+y}'
Where:
awk ' # Invoking awk and its instructions
/cmq/{x=$2} # Select line with "cmq", then set its value to x. Both must be tied
/fwp/{y=$2} # Select line with "fwp", then set its value to y. Both must be tied
END # Ends pattern matching/creation
{print x+y} # Print the calculated result
' # Ending awk's instructions
Unfortuanately, two variables are used (x and y).
So, I'm still interested on finding how to do it without any variable, or only one at the very most.
I do have a single-variable way for summing:
awk '/cmq|fwp/ {x+=$2} END {print x}'
But doing this for subtracting:
awk '/cmq|fwp/ {x-=$2} END {print x}'
doesn't work.
As an subsidiary question, anyone knows to achieve such subtracting without or with only one variable ?

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources