awk compare fields from two different files - shell

Here I have tried awk script to compare fields from two different files.
awk 'NR == FNR {if (NF >= 4) a[$1] b[$4]; next} {for (i in a) for (j in b) if (i >= $2 && i <=$3 && j>=$2 && j<=$3 ) {print $1, $2, $3, i, j; next}}' file1 file2
Input files:
File1:
24926 17 206 25189 5.23674 5.71882 4.04165 14.99721 c
50760 17 48 50874 3.49903 4.25043 7.66602 15.41548 c
104318 15 269 104643 2.94218 5.18301 5.97225 14.09744 c
126088 17 70 126224 3.12993 5.32649 6.14936 14.60578 c
174113 16 136 174305 4.32339 2.36452 8.60971 15.29762 c
196474 14 89 196626 2.24367 5.16966 7.33723 14.75056 c
......
......
File2:
GT_004279 1 280
GT_003663 19891 20217
GT_003416 22299 23004
GT_003151 24916 25391
GT_001715 39470 39714
GT_001585 40896 41380
....
....
The output which I got is:
GT_004279 1 280 2465483 2639576
GT_003663 19891 20217 2005645 2005798
GT_003416 22299 23004 2291204 2269898
GT_003151 24916 25391 2501183 25189
GT_001715 39470 39714 3964440 3950417
......
......
The desired output should be 1st and 4th field values from file1 lies in between 2nd and 3rd field values from file2. For example, If I have taken above given lines as INPUT files, the output must be..
GT_003151 24916 25391 24926 25189
If I guess correctly the problem is within the If loop. So, Could someone help to rectify this problem.
Thanks

You need to make composite keys and iterate through them. When you create such composite keys they are separated by SUBSEP variable. So you just split based on that and do the check.
awk '
NR==FNR{ flds[$1,$4]; next }
{
for (key in flds) {
split (key, fld, SUBSEP)
if ($2<=fld[1] && $3>=fld[2])
print $0, fld[1], fld[2]
}
}' file1 file2
GT_003151 24916 25391 24926 25189

Related

merge 2 files based on 1,2,3 column matching in 2 files & print specific columns from 2 files. also retain non-pairing rows from both files in output [duplicate]

file1
rs12345 G C
rs78901 A T
file2
3 22745180 rs12345 G C,G
12 67182999 rs78901 A G,T
desired output
3 22745180 rs12345 G C
12 67182999 rs78901 A T
I tried
awk 'NR==FNR {h[$1] = $3; next} {print $1,$2,$3,h[$2]}' file1 file2
output generated
3 22745180 rs12345
print first 4 columns of file2 and 3rd column of file1 as 5th col in output
You may use this awk:
awk 'FNR == NR {map[$1,$2] = $3; next} ($3,$4) in map {$NF = map[$3,$4]} 1' f1 f2 | column -t
3 22745180 rs12345 G C
12 67182999 rs78901 A T
A more readable version:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
($3,$4) in map {
$NF = map[$3,$4]
}
1' file1 file2 | column -t
Used column -t for tabular output only.
In the case you presented (rows in both files are matched) this will work
paste file1 file2 | awk '{print $4,$5,$6,$7,$3}'

Print the output if the first column of three file match

I'm having 3 files with data as below:
File1-
a 10
b 20
c 30
File2-
a 11
b 22
c 45
d 33
File3-
a 23
b 33
c 46
I need to print the output like below if the first column of the three files matches:
a 10 11 23
b 20 22 33
c 30 45 46
I tried the below code but not getting the required output:
#!/bin/bash
awk 'FNR==NR{a[$1]=$2;next} {print $0,$1 in a?a[$1]:""}' File1 File2 File3
With your shown samples, could you please try following. Written and tested with GNU awk.
awk '
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2
count[$1]++
}
END{
for(key in arr){
if(count[key]==(ARGC-1)){
print key,arr[key]
}
}
}
' Input_file1 Input_file2 Input_file3
NOTE: Just want to add here this answer to a new answer from all mentioned answers in shared dupe link under comments of question.
With shown samples output will be as follows.
a 10 11 23
b 20 22 33
c 30 45 46
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2 ##Creating arr with index of first field and value is 2nd field and keep appending its value in array.
count[$1]++ ##Creating count array with index of 1st field and keep increasing it.
}
END{ ##Starting END block of this program from here.
for(key in arr){ ##Traversing through all items in arr.
if(count[key]==(ARGC-1)){ ##Checking condition if count with index of key is equal to ARGC-1 then print current item with its value.
print key,arr[key]
}
}
}
' file1 file2 file3 ##Mentioning Input_file names here.
Using join:
join <(join File1 File2) File3
Join works with two files and so redirect the result of "join file1 and file2" back into another join command that compares this to file3

Can someone walk through this awk code for merging multiple files?

I'm using awk to merge multiple (>3) files, and I want to keep the headers. I found a previous post that does exactly what I need, but I don't quite understand what's happening. I was hoping someone could walk me through it so I can learn from it! (I tried commenting on the original post but did not have enough reputation)
This code
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*
transforms the input files as desired. See example tables below.
Input files:
file1.txt:
id value1
a 10
b 30
c 50
file2.txt:
id value2
a 90
b 30
c 20
file3.txt:
id value3
a 0
b 1
c 25
desired output
merge.txt:
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
Again, here's the code
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f* > merge.txt
I'm having trouble understanding the first part of the code {a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}, but understand the loop in the second part of the code.
I think in the first part of the code, an array is being established. The code runs through and check for matching records on the first column id, and if there's a match then append the second column ($2) value and print the entire record ($0).
But...I don't understand the beginning syntax. When is it established that the first column id is the same across all three files and to only add the second column?
That code is buggy and unnecessarily complicated, use this instead:
$ awk 'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
pipe the output to column -t for alignment if you like:
$ awk 'NR==FNR{a[NR]=$0;next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 | column -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
If you NEED to key off ids (e.g. because they differ across the files) then it'd be:
$ awk '
BEGIN { OFS="\t" }
!($1 in a) { ids[++numIds]=$1 }
{ a[$1][ARGIND]=$2 }
END {
for (i=1;i<=numIds;i++) {
id = ids[i]
printf "%s%s", id, OFS
for (j=1;j<=ARGIND;j++) {
printf "%s%s", a[id][j], (j<ARGIND ? OFS : ORS)
}
}
}
' file1 file2 file3 | column -s$'\t' -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 25
x 20
That last script used GNU awk for multi-dimensional arrays and just had c changed to x in input file2 to test it.
Feel free to ask if you have questions but I THINK that code is pretty clear.
First the data:
file1 file2 file3
NR FNR $1 $2 NR FNR $1 $2 NR FNR $1 $2
================ ================ ================
1 1 id value1 5 1 id value2 9 1 id value3
2 2 a 10 6 2 a 90 10 2 a 0
3 3 b 30 7 3 b 30 11 3 b 1
4 4 c 50 8 4 c 20 12 4 c 25
The first part: a[FNR]=( (a[FNR]) ? a[FNR]FS$2 : $0 ) could be written as:
if(a[FNR]=="") # actually if(a[FNR]=="" || a[FNR]==0)
a[FNR]=$0 # a[FNR] is "id value1" when NR==1
else
a[FNR]=a[FNR] FS $2 # a[FNR]="id value1" FS "value2" when NR==5
Each file has 4 records, ie. FNR==4 on the last record of each file, especially the last file, since value of FNR remains after processing the last file:
END { # after hashing all record in all files
for(i=1;i<=FNR;i++) # i=1, 2, 3, 4
print a[i] # print "id value1 value value3" etc.
}
James has explained pretty well the awk logic in his answer.
In case you're looking for an alternative here is a paste based solution:
paste file1 file2 file3 | awk '{print $1, $2, $4, $6}' OFS='\t'
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
FNR is the number of records relative to the current input file. So the line number in file1, file2 etc. http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web
The ? is the ternary operator and is saying, if there's already something in a[FNR] then append $2 of the current record to what's there, else it's null so store the whole record (i.e. $0).
Pseudo code that may help explain things:
if a[FNR] != ""
a[FNR] = a[FNR] : FS : $2
else
a[FNR] = $0
You can see that the a, b, c from every record after the first file is dropped - could be x, y, z and this program wouldn't care. It's taking the second field and appending to a[2], a[3] etc.
You can use awk with pr to do this:
$ pr -mts$'\t' f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
(Those are tabs in between the columns)
Or use paste the same way:
$ paste f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25

How to parse through a csv file by awk?

Actually I have csv file with suppose 20 headers and they have corresponding values for those headers in the next row for a particular record.
Example : Source file
Age,Name,Salary
25,Anand,32000
I want my output file to be in this format.
Example : Output file
Age
25
Name
Anand
Salary
32000
So for doing this which awk/grep/sed command to be used?
I'd say
awk -F, 'NR == 1 { split($0, headers); next } { for(i = 1; i <= NF; ++i) { print headers[i]; print $i } }' filename
That is
NR == 1 { # in the first line
split($0, headers) # remember the headers
next # do nothing else
}
{ # after that:
for(i = 1; i <= NF; ++i) { # for all fields:
print headers[i] # print the corresponding header
print $i # followed by the field
}
}
Addendum: Obligatory, crazy sed solution (not recommended for productive use; written for fun, not profit):
sed 's/$/,/; 1 { h; d; }; G; :a s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/; ta; s/^\n\n//' filename
That works as follows:
s/$/,/ # Add a comma to all lines for more convenient processing
1 { h; d; } # first line: Just put it in the hold buffer
G # all other lines: Append hold bufffer (header fields) to the
# pattern space
:a # jump label for looping
# isolate the first fields from the data and header lines,
# move them to the end of the pattern space
s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/
ta # do this until we got them all
s/^\n\n// # then remove the two newlines that are left as an artifact of
# the algorithm.
Here is one awk
awk -F, 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} {for (i=1;i<=NF;i++) print a[i] RS $i}' file
Age
25
Name
Anand
Salary
32000
First for loop store the header in array a
Second for loop prints header from array a with corresponding data.
Using GNU awk 4.* for 2D arrays:
$ awk -F, '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) print a[j][i]}' file
Age
25
Name
Anand
Salary
32000
In general to transpose rows and columns:
$ cat file
11 12 13
21 22 23
31 32 33
41 42 43
with GNU awk:
$ awk '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43
or with any awk:
$ awk '{for (i=1;i<=NF;i++) a[NR][i]=$i} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources