Bash/Awk Compare two files, print value when it is between coordinates else print 0 - bash

I have two files. If the column "chromosome" matches between the two files and the position of File1 is between the Start_position and the End_position of File2, I would like to associate the two cell_frac values. If the Gene (chromosome + position) is not present in File2, I would like both cell_frac values to be equal to 0.
File1:
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
File2:
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
Desired output:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Edit: Here is the beginning of the code I used for now (not correct output):
awk '
NR==FNR{ range[$1,$2,$3,$4,$5]; next }
FNR==1
{
for(x in range) {
split(x, check, SUBSEP);
if($2==check[1] && $3>=check[2] && $3<=check[3]) { print $1"\t"$2"\t"$3"\t"check[4]"\t"check[5]}
}
}
' File2 File1
However, I did not manage to associate a 0 (with "else") when the gene was not present. I get the wrong number of lines. I Can you give me more hints?
Thanks a lot.

One awk-only idea ...
NOTE: see my other answer for assumptions/understandings and my version of file1
awk ' # process file2
FNR==NR { c=$1 # save chromosome value
$1="" # clear field #1
file2[c]=$0 # use chromosome as array index; save line in array
next
}
# process file1
{ start=end=-9999999999 # default values in case
a1=a2="" # no match in file2
if ($2 in file2) { # if chromosome also in file2
split(file2[$2],arr) # split file2 data into array arr[]
startpos =arr[1]
endpos =arr[2]
a1 =arr[3]
a2 =arr[4]
}
# if not the header row and file1/position outside of file2/range then set a1=a2=0
if (FNR>1 && ($3 < startpos || $3 > endpos)) a1=a2=0
print $0,a1,a2
}
' file2 file1
This generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Changing the last line to ' file2 file1 | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Presorting file1 by Chromosome and Position by changing last line to ' file2 <(head -1 file1; tail -n +2 file1 | sort -k2,3) | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
One big issue (same as with my other answer) ... the actual code may become unweidly when dealing with 519 total columns especially if there's a need to intersperse a lot of columns; otherwise OP may be able to use some for loops to more easily print ranges of columns.

A job for sql instead of awk, perhaps?
tr -s ' ' '|' <File1 >file1.csv
tr -s ' ' '|' <File2 >file2.csv
(
echo 'Hugo_Symbol|Chromosome|Position|cell_frac_A1|cell_frac_A2'
sqlite3 <<'EOD'
.import file1.csv t1
.import file2.csv t2
select distinct
t1.hugo_symbol,
t1.chromosome,
t1.position,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a1
else 0
end,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a2
else 0
end
from t1 join t2 on t1.chromosome=t2.chromosome;
EOD
rm file[12].csv
) | tr '|' '\t'

Assumptions and/or understandings based on sample data and OP comments ...
file1 is not sorted by chromosome
file2 is sorted by chromosome
common headers in both files are spelled the same (eg, file1:Chromosome vs file2:Chromosom)
if a chromosome exists in file1 but does not exist in file2 then we keep the line from file1 and the columns from file2 are left blank
both files are relatively small (file1: 5MB, 900 lines; file2: few KB, 50 lines)
NOTE: the number of columns (file1: 500 columns; file2: 20 columns) could be problematic from the point of view of cumbersome coding ... more on that later ...
Sample inputs:
$ cat file1 # scrambled chromsome order; added chromosome=4 line
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene3 2 333333
Gene2 1 222222
Gene4 2 333337
Gene5 4 444567 # has no match in file2
$ cat file2
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
First issue is to sort file1 by Chromosome and Position and also keep the header line in place:
$ (head -1 file1; tail -n +2 file1 | sort -k2,3)
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
Gene5 4 444567
We can now join the 2 files based on the Chromosome column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2
Chromosome Hugo_Symbol Position Start_Position End_Position cell_frac_A1 cell_frac_A2
1 Gene1 111111 222220 222230 0.12 0.01
1 Gene2 222222 222220 222230 0.12 0.01
2 Gene3 333333 333330 333340 0.03 0.25
2 Gene4 333337 333330 333340 0.03 0.25
4 Gene5 444567
Where:
-1 2 -2 1 - join on Chromosome columns: -1 2 == file #1 column #2; -2 1 == file #2 column #1
-a 1 - keep columns from file #1 (sorted file1)
--nocheck-order - disable verifying input is sorted by join column; optional; may be needed if a locale thinks 1 should be sorted before Chromosome
NOTE: for the sample inputs/outputs we don't need a special output format so we can skip the -o option, but this may be needed depending on OP's output requirements for 519 total columns (but it may also become unwieldly)
From here OP can use bash or awk to do comparisons (is column #3 between columns #4/#5); one awk idea:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}'
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0 # Position outside of range
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0 # no match in file2; if there were other columns from file2 they would be empty
And to match OP's sample output (appears to be a fixed width requirement) we can pass this to column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}' | column -t
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
NOTE: Keep in mind this may be untenable with OP's 519 total columns, especially if interspersed columns contain blanks/white-space (ie, column -t may not parse the input properly)
Issues (in addition to incorrect assumptions and previous NOTES):
for relatively small files the performance of the join | awk | column should be sufficient
for larger files all of this code can be rolled into a single awk solution though memory usage could be an issue on a small machine (eg, one awk idea would be to load file2 into memory via arrays so memory would need to be large enough to hold all of file2 ... probably not an issue unless file2 gets to be 100's/1000's of MBytes in size)
for 519 total columns the awk/print will get unwieldly especially if there's a need to move/intersperse a lot of columns

Related

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
[Explanations]
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
as:
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

The simplest way to join 2 files using bash and both of their keys appear in the result

I have 2 input files
file1
A 0.01
B 0.09
D 0.05
F 0.08
file2
A 0.03
C 0.01
D 0.04
E 0.09
The output I want is
A 0.01 0.03
B 0.09 NULL
C NULL 0.01
D 0.05 0.04
E NULL 0.09
F 0.08 NULL
The best that I can do is
join -t' ' -a 1 -a 2 -1 1 -2 1 -o 1.1,1.2,2.2 file1 file2
which doesn't give me what I want
You can write:
join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file1 file2
where I've made these changes:
In the output format, I changed 1.1 ("first column of file #1") to 0 ("join field"), so that values from file #2 can show up in the first field when necessary. (Specifically, so that C and E will.)
I added the -e option to specify a value (NULL) for missing/empty fields.
I used $'\t', which Bash converts to a tab, instead of typing an actual tab. I find this easier to use than a tab in the middle of the command. But if you disagree, and the actual tab is working for you, then by all means, you can keep using it. :-)

Resources