awk match substring in column from 2 files - bash

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.

You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281

Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

Related

awk for string comparison with multiple delimiters

I have a file with multiple delimiters, I m looking to compare the value after the first / when read from right left with another file.
code :-
awk -F'[/|]' NR==FNR{a[$3]; next} ($1 in a )' file1 file2 > output
cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
cat file2
customer
trust
expected output :-
AAB/BBC/customer|fed|12931|
/customer|fed|12931|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
$ awk -F\| ' # just use one FS
NR==FNR {
a[$1]
next
}
{
n=split($1,t,/\//) # ... and use split to the 1st field
if(t[n] in a) # and compare the last split part
print
}' file2 file1
Output:
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
If you use this [/|] you will have 2 delimiters and you will not know what the value after the last pipe was.
Reading your question, you want to compare the first value after the last slash without pipe chars.
If there has to be a / present in the string, you can set that as the field separator and check if there are at least 2 fields using NF > 1
Then take the last field using $NF, split on | and check if the first part is present in one of the values of file2 which are stored in array a
$cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
customer
Example code
awk -F/ '
NR==FNR {a[$1];next}
NF > 1 {
split($NF, t, "|")
if(t[1] in a) print
}
' file2 file1
Output
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|

counting occurence of character

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.
For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2
Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt
Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.
datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

Iterate through list in bash and run multiple grep commands

I would like to iterate through a list and grep for the items, then use awk to pull out important information from each grep result. (This is the way I thought to do it, but awk and grep aren't necessary if there is a better way).
The input file contains a number of lines that looks similar to this:
chr1 12345 . A G 3e-12 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;
I have a number of locations that should match some part of the second column.
locList="123, 789"
And for each matching location I would like to get the information from columns 4 and 5 and write them to an output file with the corresponding location.
So the output for the above list should be:
123 A G
Something like this is what I'm thinking:
for i in locList; do
grep i inputFile.txt | awk '{print $2,$4,$5}'
done
Invoking grep/awk once per location will be highly inefficient. You want to invoke a single command that will do your parsing. For example, awk:
awk -v locList="12345 789" '
BEGIN {
# parse the location list, and create an array where
# the locations are the array indexes
n = split(locList, a)
for (i=1; i<=n; i++) locations[a[i]] = 1
}
$2 in locations {print $2, $4, $5}
' file
revised requirements
awk -v locList="123 789" '
BEGIN { n = split(locList, patterns) }
{
for (i=1; i<=n; i++) {
if ($2 ~ "^" patterns[i]) {
print $2, $4, $5
break
}
}
}
' file
The ~ operator is the regular expression matching operator.
That will output 12345 A G from your sample input. If you just want to output 123 A G then print patterns[i] instead of $2.
awk -v locList='123|789' '$2~"^("locList")" {print $2,$4,$5}' file
or if you prefer:
locList='123, 789'
awk -v locList="^(${locList//, /|})" '$2~locList {print $2,$4,$5}' file
or whatever other permutation you like. The point is you don't need a loop at all - just create a regexp from the list of numbers in locList and test that regexp once.
What I would do :
locList="123 789"
for i in $locList; do awk -vvar=$i '$2 ~ var{print $4, $5}' file; done

Compare multiple Columns and Append the result into another file

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat
Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

Resources