counting occurence of character - bash

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.

For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2

Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt

Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.

datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

awk FS vs FPAT puzzle and counting words but not blank fields

Suppose I have the file:
$ cat file
This, that;
this-that or this.
(Punctuation at the line end is not always there...)
Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:
sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n" | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1 or
2 that
3 this
With grep you can shorten that a bit to only match what you define as a word:
grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output
With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):
gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
3 this
1 or
2 that
Now trying to replicate in POSIX awk I tried:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
2
3 this
1 or
2 that
Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.
You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.
I can also fix it this way:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
Which seems a little less than straight forward.
Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?
This should work in POSIX/BSD or any version of awk:
awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file
1 or
3 this
2 that
By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.
With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:
#!awk
{
s = $0
while (match(s, /[[:alpha:]]+/)) {
word = substr(s, RSTART, RLENGTH)
count[tolower(word)]++
s = substr(s, RSTART+RLENGTH)
}
}
END {
for (word in count) print count[word], word
}
$ awk -f countwords.awk file
1 or
3 this
2 that
Works with the default BSD awk on my Mac.
With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.
awk -v RS='[[:alpha:]]+' '
RT{
val[tolower(RT)]++
}
END{
for(word in val){
print val[word], word
}
}
' Input_file
Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.
Using RS instead:
$ gawk -v RS="[^[:alpha:]]+" ' # [^a-zA-Z] or something for some awks
$0 { # remove possible leading null string
a[tolower($0)]++
}
END {
for(i in a)
print i,a[i]
}' file
Output:
this 3
or 1
that 2
Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]
With GNU awk using patsplit() and a second array for counting, you can try this:
awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that

How to grep for multiple word occurrences from multiple files and list them grouped as rows and columns

Hello: Need your help to count word occurrences from multiple files and output them as row and columns. I searched the site for a similar reference but could not locate, hence posting it here.
Setup:
I have 2 files with the following
[a.log]
id,status
1,new
2,old
3,old
4,old
5,old
[b.log]
id,status
1,new
2,old
3,new
4,old
5,new
Results required
The result i require using the command line only is (preferably):
file count(new) count(old)
a.log 1 4
b.log 3 2
Script
The script below provides me the count for a single word across multiple.
I am stuck trying to get results for multiple words. Please help.
grep -cw "old" *.log
You can get this output using gnu-awk that accepts comma separated word to be searched in a command line argument:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN{n=split(wrds, a, /,/); for(i=1; i<=n; i++) b[a[i]]=a[i]} FNR==1{next} $2 in b{freq[FILENAME][$2]++} END{printf "%s", "file" OFS; for(i=1; i<=n; i++) printf "count(%s)%s", a[i], (i==n?ORS:OFS); for(f in freq) {printf "%s", f OFS; for(i=1; i<=n; i++) printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)}}' a.log b.log | column -t
Output:
file count(new) count(old)
a.log 1 4
b.log 3 2
PS: column -t was only used for formatting the output in tabular format.
Readable awk:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN {
n = split(wrds, a, /,/) # split input words list by comma with int index
for(i=1; i<=n; i++) # store words in another array with key as words
b[a[i]]=a[i]
}
FNR==1 {
next # skip first row from all the files
}
$2 in b {
freq[FILENAME][$2]++ # store filename and word frequency in 2-dimesional array
}
END { # print formatted result
printf "%s", "file" OFS
for(i=1; i<=n; i++)
printf "count(%s)%s", a[i], (i==n?ORS:OFS)
for(f in freq) {
printf "%s", f OFS
for(i=1; i<=n; i++)
printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)
}
}' a.log b.log
I think you're looking for something like this, but it's not overly clear what your objectives are (if you're going for efficiency for example, this isn't overly efficient)...
for file in *.log; do
echo -n "${file}\t"
for word in "new" "old"; do
grep -cw $word $file;
echo -n "\t";
done
echo;
done
(for readability, I simplified the first line, but this doesn't work if there's spaces in the filenames -- the proper solution is to change the first line to read find . -iname "*.log" -maxdepth=1 | while read file; do)
for c in a b ; do egrep -o "new|old" $c.log | sort | uniq -c > $c.luc; done
Get rid of the headlines with grep, then sort and count.
join -1 2 -2 2 a.luc b.luc
> new 1 3
> old 4 2
Placing a new header is left as an exercise for the reader. Is there a flip command for unix/linux/bash to flip a table, or how would you say?
Handling empty cells is left as an exercise too, but possible with join.
without real multi-dim array support, this will count all values in field 2, not just "new/old". The header and number of columns are dynamic with number of distinct values as well.
$ awk -F, 'NR==1 {fs["file"]}
FNR>1 {c[FILENAME,$2]++; fs[FILENAME]; ks[$2];
c["file",$2]="count("$2")"}
END {for(f in fs)
{printf "%s", f;
for(k in ks) printf "%s", OFS c[f,k];
printf "\n"}}' file{1,2} | column -t
file count(new) count(old)
file1 1 4
file2 3 2
Awk solution:
awk 'BEGIN{
FS=","; OFS="\t"; print "file","count(new)","count(old)";
f1=ARGV[1]; f2=ARGV[2] # get filenames
}
FNR==1{ next } # skip the 1st header line
NR==FNR{ c1[$2]++; next } # accumulate occurrences of the 2nd field in 1st file
{ c2[$2]++ } # accumulate occurrences of the 2nd field in 2nd file
END{
print f1, c1["new"], c1["old"];
print f2, c2["new"], c2["old"]
}' a.log b.log
The output:
file count(new) count(old)
a.log 1 4
b.log 3 2

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

Iterate through list in bash and run multiple grep commands

I would like to iterate through a list and grep for the items, then use awk to pull out important information from each grep result. (This is the way I thought to do it, but awk and grep aren't necessary if there is a better way).
The input file contains a number of lines that looks similar to this:
chr1 12345 . A G 3e-12 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;
I have a number of locations that should match some part of the second column.
locList="123, 789"
And for each matching location I would like to get the information from columns 4 and 5 and write them to an output file with the corresponding location.
So the output for the above list should be:
123 A G
Something like this is what I'm thinking:
for i in locList; do
grep i inputFile.txt | awk '{print $2,$4,$5}'
done
Invoking grep/awk once per location will be highly inefficient. You want to invoke a single command that will do your parsing. For example, awk:
awk -v locList="12345 789" '
BEGIN {
# parse the location list, and create an array where
# the locations are the array indexes
n = split(locList, a)
for (i=1; i<=n; i++) locations[a[i]] = 1
}
$2 in locations {print $2, $4, $5}
' file
revised requirements
awk -v locList="123 789" '
BEGIN { n = split(locList, patterns) }
{
for (i=1; i<=n; i++) {
if ($2 ~ "^" patterns[i]) {
print $2, $4, $5
break
}
}
}
' file
The ~ operator is the regular expression matching operator.
That will output 12345 A G from your sample input. If you just want to output 123 A G then print patterns[i] instead of $2.
awk -v locList='123|789' '$2~"^("locList")" {print $2,$4,$5}' file
or if you prefer:
locList='123, 789'
awk -v locList="^(${locList//, /|})" '$2~locList {print $2,$4,$5}' file
or whatever other permutation you like. The point is you don't need a loop at all - just create a regexp from the list of numbers in locList and test that regexp once.
What I would do :
locList="123 789"
for i in $locList; do awk -vvar=$i '$2 ~ var{print $4, $5}' file; done

Resources