How a loop works in awk ? and do we get matched data from two files? - shell

I am trying to extract data from two files with a common column but I am unable to fetch the required data.
File1
A B C D E F G
Dec 3 abc 10 2B 21 OK
Dec 1 %xyZ 09 3F 09 NOK
Dec 5 mnp 89 R5 11 OK
File2
H I J K
abc 10 6.3 A9
xyz 00 0.2 2F
pqr 45 6.9 3c
I am able to get output A B C D E F G but unable to add columns of File2 in between columns in File1 column.
Trail:
awk 'FNR==1{next}
NR==FNR{a[$1]=$2; next}
{k=$3; sub(/^\%/,"",k)} k in a{print $1,$2,$3,$a[2,3,4],$4,$5,$6,$7; delete a[k]}
END{for(k in a) print k,a[k] > "unmatched"}' File2 File1 > matched
Required output:
matched:
A B I C J K D E F G
Dec 3 10 abc 6.3 A9 10 2B 21 OK
Dec 1 00 %syz 0.2 2F 09 3F 09 NOK
unmatched :
H I J K
pqr 45 6.9 3c
Could you please help me for getting this output please ? Thank you.

awk '
FNR == 1 { next }
FNR==NR {
As[ $3] = $0
S3 = $3
gsub( /%/, "", S3)
ALs[ tolower( S3)] = $3
next
}
{
Bs[ tolower( $1)] = $0
}
END {
print "matched:"
print "A B I C J K D E F G"
for ( B in Bs){
if ( B in ALs){
split( As[ ALs[B]] " " Bs[B], Fs)
printf( "%s %s %s %s %s %s %s %s %s %s\n", Fs[1], Fs[2], Fs[9], Fs[3], Fs[10], Fs[11], Fs[4], Fs[5], F[6], F[7])
}
}
print "unmatched :"
print "H I J K"
for ( B in Bs) if ( ! ( B in ALs)) print Bs[ B]
}
' File1 File2
added non define constraint of ignore case of reference (%xyZ vs xyz)
need to keep both file in memory (array) to treat at the end. Matched could be done at reading. I keep, for understanding purpose output at END level
Your problem:
you mainly take reference to wrong file in your code (k=$3 is used when reading File2 with field from File1 reference, ...)

Related

Awk flag to remove unwanted data

another awk question.
I have a large text file that is separated by numerical values
43 47
abc
efg
hig
21 122
hijk
lmnop
39 41
somemore
texthere
what i would like to do is print the text only if a condition is satisfied.
here's what i have tried, with no luck
awk '{a=$1; b=$2; if (a < 43 && a > 37 && b < 52 && b > 41) {f=1} elif (a > 43 && a < 37 && b > 52 && b < 41) {print; f=0} } f' file
I'd like to print all of the text if the statement is satisfied and i'd like to skip the text if the statement isn't satisfied.
desired output from above
43 47
abc
efg
hig
39 41
somemore
texthere
awk '
# on a line with 2 numbers:
NF == 2 && $1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {
# set a flag if the numbers fall in the given ranges
f = (37 <= $1 && $1 <= 43 && 41 <= $2 && $2 <= 52)
}
f
' file
Self-explanining solution:
awk '
function inrange(x, a, b) { return a <= x && x <= b }
/^[0-9]+[\t ]+[0-9]/ {
f = inrange($1, 37, 43) && inrange($2, 41, 52)
}
f
'

Awk - Control when my $# variables are expanded to merge two files with variable number of columns

My bash script is calling a awk script that nicely merges two files
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t"}
FNR==NR{hash1['"\$${mapfieldfile2}"']=$1 FS $3 FS $4 FS $5 FS $6;next}
('"\$${mapfieldfile1}"' in hash1){ print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
However I want to a more general version,where I don't have to hardcode the columns that I want to print, I simply want to print everything but my id column. Replacing $1 FS $3 FS $4 FS $5 FS $6 for $0 "almost" does the work, except that repeats the id column. I have been trying to dynamically create a a string similar to the $1 FS $3 FS $4 FS $5 FS $6 but I am getting literally the $1 $3 $4 $5 $6 strings in the merged file, as opposed to expanding their values. Also, smaller side effects: I am adding a tab in the middle and losing some headers, below is the code and example files.
I would like to find the solution to my merge and also understand what I am doing wrong and why my variables are not expanding.
I appreciate any help!
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t";strfields=""}
FNR==NR{for(i=1;i<=NF;i++) if(i!='"${mapfieldfile2}"') {strfields=strfields" "FS" $"i};
hash1['"\$${mapfieldfile2}"']=strfields;strfields="";next}
('"\$${mapfieldfile1}"' in hash1){print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
$cat file1
sampleid s1 s2 s3 s4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
$cat file2
a0 sampleid a1 a2 a3 a4
a0 1 a a a a4
a0 2 b b b a4
a0 3 c c c a4
a0 5 e e e a4
$cat first_code_result.txt (good one!)
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
$cat second_code_result.txt
sampleid s1 s2 s3 s4 $1 $3 $4 $5 $6
1 1 1 1 1 $1 $3 $4 $5 $6
2 2 2 2 2 $1 $3 $4 $5 $6
3 3 3 3 3 $1 $3 $4 $5 $6
Try this (untested):
awk -v mf1="$mapfieldfile1" -v mf2="$mapfieldfile2" '
BEGIN {FS=OFS="\t"}
FNR==NR{sub(/\t[^\t]+/,""); hash1[$mf2]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
Don't let shell variables expand within awk scripts, use a regexp to remove fields from the record and idk why the script you haven't shown us is printing literally $3, etc. but you must be including them in a string. You'd have to post that script for help debugging it.
Check where mf1 vs mf2 should appear, I got confused reading your scripts.
EDIT - I had to tweak it as above I was deleting $2 before using it:
$ awk -v mf1="1" -v mf2="2" '
BEGIN {FS=OFS="\t"}
FNR==NR{key=$mf2; sub(/\t[^\t]+/,""); hash1[key]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
Note that the sub() above relies on the key field being $2 and FS being a tab. If you need a more general solution let us know.
Here's a version that'll do what you want for any key field values and will work in any awk, it just requires the FS to be a tab or some other fixed string (i.e. not a regexp):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
key = $mf2
val = ""
nf = 0
for (i=1; i<=NF; i++) {
if (i != mf2) {
val = (nf++ ? val FS : "") $i
}
}
hash1[key] = val
next
}
$mf1 in hash1 { print $0, hash1[$mf1] }
$ awk -v mf1="1" -v mf2="2" -f tst.awk file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
if your files are sorted already, the default output of join is what you want
$ join -t$'\t' -11 -22 file1 file2
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
or, after prettying with column
$ join -t$'\t' -11 -22 file1 file2 | column -t
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

how to use awk to merge files with common fields and print in another file

I have read all the related questions, but still quite confuse...
I have two files tab separated.
file1 (breaks added for readability):
a 15 bac
g 10 bac
h11 bac
r 33 arq
t 12 euk
file2 (breaks added for readability):
0 15 h 3 5 2 gf a a g e g s s g g
p 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Output desired (breaks added for readability):
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Just that. I need to print the complete file2 but in the first column I need to replace with the third column of file1 only when $2 of file2 is the same that $2 of file1...
file1 is larger than file2, but still could happen that $2 from file2 is not present in file1, in that case print in the first column ND.
I'm sure it must be simple, but I have problems with awk managing two files. Please, if someone could help me...
Using this awk command:
awk 'FNR==NR{a[$2]=$3;next} {$1=(a[$2])?a[$2]:"ND"} 1' file1 file2
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Explanation:
FNR==NR - Execute this block for first file in input i.e. file1
a[$2]=$3 - Populate an associative array a with key as $2 and value as $3 from file1
next - Read next line until EOF on first file
Now operating in file2
$1=(a[$2])?a[$2]:"ND" - Overwrite $1 with a[$2] if $2 is found in array a, otherwise by literal string "ND"
1 - print the output
You could try with join + awk command as below:
join -t ' ' -a2 -1 2 -2 2 test1.txt test2.txt | awk 'BEGIN { start = 5; end = 18 } { if (NF == 16) { temp = $1; $1 = "ND " $2; $2 = temp; print } else { printf("%s %s ", $3, $1); for (i=start; i<=end; i++) printf ("%s ", $i); printf("\n");}}'

AWK -- How to do selective multiple column sorting?

In awk, how can I do this:
Input:
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
Desired output, by sorting numerical value at $5:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Note that the sorting should only affecting $4, $5, and $6 (based on value of $5), in which the previous part of table remains intact.
This could be done in multiple steps with the help of paste:
$ gawk '{print $1, $2, $3}' in.txt > a.txt
$ gawk '{print $4, $5, $6}' in.txt | sort -k 2 -n b.txt > b.txt
$ paste -d' ' a.txt b.txt
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Personally, I find using awk to safely sort arrays of columns rather tricky because often you will need to hold and sort on duplicate keys. If you need to selectively sort a group of columns, I would call paste for some assistance:
paste -d ' ' <(awk '{ print $1, $2, $3 }' file.txt) <(awk '{ print $4, $5, $6 | "sort -k 2" }' file.txt)
Results:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
This can be done in pure awk, but as #steve said, it's not ideal. gawk has limited sort functions, and awk has no built-in sort at all. That said, here's a (rather hackish) solution using a compare function in gawk:
[ghoti#pc ~/tmp3]$ cat text
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
[ghoti#pc ~/tmp3]$ cat doit.gawk
### Function to be called by asort().
function cmp(i1,v1,i2,v2) {
split(v1,a1); split(v2,a2);
if (a1[2]>a2[2]) { return 1; }
else if (a1[2]<a2[2]) { return -1; }
else { return 0; }
}
### Left-hand-side and right-hand-side, are sorted differently.
{
lhs[NR]=sprintf("%s %s %s",$1,$2,$3);
rhs[NR]=sprintf("%s %s %s",$4,$5,$6);
}
END {
asort(rhs,sorted,"cmp"); ### This calls the function we defined, above.
for (i=1;i<=NR;i++) { ### Step through the arrays and reassemble.
printf("%s %s\n",lhs[i],sorted[i]);
}
}
[ghoti#pc ~/tmp3]$ gawk -f doit.gawk text
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
[ghoti#pc ~/tmp3]$
This keeps your entire input file in arrays, so that lines can be reassembled after the sort. If your input is millions of lines, this may be problematic.
Note that you might want to play with the printf and sprintf functions to set appropriate output field separators.
You can find documentation on using asort() with functions in the gawk man page; look for PROCINFO["sorted_in"].

Resources