Using Awk and match() - bash

I have a sequencing file to analyze that has many lines like the following tab separated line:
chr12 3356475 . C A 76.508 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1/1:3:0:0:3:111:-10,-0.90309,0
I am trying to use awk to match particular regions to their DP value. This is how I'm trying it:
awk '$2 == 33564.. { match(DP=) }' file.txt | head
Neither the matching nor the wildcards seem to work.
Ideally this code would output 3 because that is what DP equals.

You can use either ; or tab as the field delimiter. Doing so you can access the number in $2 and the DP= field in $14:
awk -F'[;\t]' '$2 ~ /33564../{sub(/DP=/,"",$14);print $14}' file.txt
The sub function is used to remove DP= from $14 which leaves only the value.
Btw, if you also add = to the set of field delimiters the value of DP will be in field 21:
awk -F'[;\t=]' '$2 ~ /33564../{print $21}' file.txt

Having worked with genomic data, I believe that the following will be more robust than the previously posted solution. The main difference is that the key-value pairs are treated as such, without any assumption about their ordering, etc. The minor difference is the carat ("^") in the regex:
awk -F'\t' '
$2 ~ /^33564../ {
n=split($8,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]=="DP") {print $2, b[2]} }}'
If this script is to be used more than once, then it would be better to abstract the lookup functionality, e.g. like so:
awk -F'\t' '
function lookup(key, string, i,n,a,b) {
n=split(string,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]==key) {return b[2]}
}
}
$2 ~ /^33564../ {
val = lookup("DP", $8);
if (val) {print $2, val;}
}'

Related

How to match a unique patter using awk?

I have a text file called 'file.txt' with the content like,
test:one
test_test:two
test_test_test:three
If the pattern is test, then the expected output should be one and similarly for the other two lines.
This is what I have tried.
pattern=test && awk '{split($0,i,":"); if (i[1] ~ /'"$pattern"'$/) print i[2]}'
This command gives the output like,
one
two
three
and pattern=test_test && awk '{split($0,i,":"); if (i[1] ~ /'"$pattern"'$/) print i[2]}'
two
three
How can I match the unique pattern being "test" for "test" and not for "test_test" and so on.
How can I match the unique pattern being test for test and not for test_test and so on.
Don't use a regex for comparing the value, just use equality:
awk -F: -v pat='test' '$1 == pat {print $2}' file
one
awk -F: -v pat='test_test' '$1 == pat {print $2}' file
two
If you really want to use regex, then use it like this with anchors:
awk -F: -v pat='test' '$1 ~ "^" pat "$" {print $2}' file
one
If you want to use a regex, you can create it dynamically with pattern and optionally repeating _ followed by pattern until matching a :
If it matches the start of the string, then you can print the second field.
awk -v pattern='test' -F: '
$0 ~ "^"pattern"(_"pattern")*:" {
print $2
}
' file
Output
one
two
three
Or if only matching the part before the first underscore is also ok, then splitting field 1 on _ and printing field 2:
awk -v pattern='test' -F: ' {
split($1, a, "_")
if(a[1] == pattern) print $2
}' file
Using GNU sed with word boundaries
$ sed -n '/\<test\>/s/[^:]*://p' input_file
one

Add string to columns in bash

I have a comma-delimited file to which I want to append a string in specific columns. I am trying to do something like this, but couldn't do it until now.
re1,1,a1e,a2e,AGT
re2,2,a1w,a2w,AGT
re3,3,a1t,a2t,ACGTCA
re12,4,b1e,b2e,ACGTACT
And I want to append 'some_string' to columns 3 and 4:
re1,1,some_stringa1e,some_stringa2e,AGT
re2,2,some_stringa1w,some_stringa2w,AGT
re3,3,some_stringa1t,some_stringa2t,ACGTCA
re12,4,some_stringb1e,some_stringb2e,ACGTACT
I was trying something similar to the suggestion solution, but to no avail:
awk -v OFS=$'\,' '{ $3="some_string" $3; print}' $lookup_file
Also, I would like my string to be added to both columns. How would you do this with awk or bash?
Thanks a lot in advance
You can do that with (almost) what you have:
pax> echo 're1,1,a1e,a2e,AGT
re2,2,a1w,a2w,AGT
re3,3,a1t,a2t,ACGTCA
re12,4,b1e,b2e,ACGTACT' | awk 'BEGIN{FS=OFS=","}{$3 = "pre3:"$3; $4 = "pre4:"$4; print}'
re1,1,pre3:a1e,pre4:a2e,AGT
re2,2,pre3:a1w,pre4:a2w,AGT
re3,3,pre3:a1t,pre4:a2t,ACGTCA
re12,4,pre3:b1e,pre4:b2e,ACGTACT
The begin block sets the input and output field separators, the two assignments massage fields 3 and 4, and the print outputs the modified line.
You need to set FS to comma, not just OFS. There's a shortcut for setting FS, it's the -F option.
awk -F, -v OFS=',' '{ $3="some_string" $3; $4 = "some_string" $4; print}' "$lookup_file"
awk's default action is to concatenate, so you can simply place strings next to each other and they'll be treated as one. 1 means true, so with no {action} it will assume "print". You can use Bash's Brace Expansion to assign multiple variables after the script.
awk '{$3 = "three" $3; $4 = "four" $4} 1' {O,}FS=,

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

Iterate through list in bash and run multiple grep commands

I would like to iterate through a list and grep for the items, then use awk to pull out important information from each grep result. (This is the way I thought to do it, but awk and grep aren't necessary if there is a better way).
The input file contains a number of lines that looks similar to this:
chr1 12345 . A G 3e-12 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;
I have a number of locations that should match some part of the second column.
locList="123, 789"
And for each matching location I would like to get the information from columns 4 and 5 and write them to an output file with the corresponding location.
So the output for the above list should be:
123 A G
Something like this is what I'm thinking:
for i in locList; do
grep i inputFile.txt | awk '{print $2,$4,$5}'
done
Invoking grep/awk once per location will be highly inefficient. You want to invoke a single command that will do your parsing. For example, awk:
awk -v locList="12345 789" '
BEGIN {
# parse the location list, and create an array where
# the locations are the array indexes
n = split(locList, a)
for (i=1; i<=n; i++) locations[a[i]] = 1
}
$2 in locations {print $2, $4, $5}
' file
revised requirements
awk -v locList="123 789" '
BEGIN { n = split(locList, patterns) }
{
for (i=1; i<=n; i++) {
if ($2 ~ "^" patterns[i]) {
print $2, $4, $5
break
}
}
}
' file
The ~ operator is the regular expression matching operator.
That will output 12345 A G from your sample input. If you just want to output 123 A G then print patterns[i] instead of $2.
awk -v locList='123|789' '$2~"^("locList")" {print $2,$4,$5}' file
or if you prefer:
locList='123, 789'
awk -v locList="^(${locList//, /|})" '$2~locList {print $2,$4,$5}' file
or whatever other permutation you like. The point is you don't need a loop at all - just create a regexp from the list of numbers in locList and test that regexp once.
What I would do :
locList="123 789"
for i in $locList; do awk -vvar=$i '$2 ~ var{print $4, $5}' file; done

Bash script to grep through one file for a list names, then grep through a second file to match those names to get a lookup value

Somehow, being specific just doesn't translate well into a title.
Here is my goal, using BASH script in a cygwin environment:
Read text file $filename to get a list of schemas and table names
Take that list of schemas and table names and find a match in $lookup_file to get a value
Use that value to make a logic choice
I basically have each item working separately. I just can't figure out how to glue it all together.
For step one, it's
grep $search_string $filename | awk '{print $1, $5}' | sed -e 's~"~~g' -e 's~ ~\t~g'
Which gives a list of schema{tab}table
For step two, it's
grep -e '{}' $lookup_file | awk '{print $3}'
Where $lookup_file is schema{tab}table{tab}value
Step three is basically, based on the value returned, do "something"; file a report, email a warning, ignore it, etc.
I tried stringing part one and two together with xargs, but it treats the schema and the table name as filenames and throws errors.
What is the glue I'm missing? Or is there a better method?
awk -v s="$search_string" 'NR == FNR { if ($0 ~ s) { gsub(/"/, "", $5); a[$1, $5] = 1; }; next; } a[$1, $2] { print $3; }' "$filename" "$lookup_file"
Explained:
NR == FNR { if ($0 ~ s) { gsub(/"/, "", $5); a[$1, $5] = 1; }; next; } targets the first file, searching for valid matches on it, and save key values in array a.
a[$1, $2] { print $3; } targets the second file and prints the value in its third column if it finds matches with the first and second column of the file and the keys in array a.
awk -v search="$search_string" '$0 ~ search { gsub(/"/, "", $5);
print $1"\t"$5; }' "$filename" |
while read line
do
result=$(awk -v search="\b$line\b" '$0 ~ search { print $3; } ' "$lookup_file");
# Do "something" with $result
done

Resources