Awk does not recognize fields as integer values - bash

I'm trying to filter one file based on two columns of another one.
The problem is that awk is not differentiating, for example, this interval 70083 83083, from position 7323573 (please see below).
The aim is to retrieve the value for file 1 that is in the column 5 of file 2.
File 1 has only one position in the column 3 ex: 51476, and the file 2 has an interval represented by column 3 and 4.
In the end I need the file 1 with respective values of the column 5 (see output).
file 1
rs187298206 chr1 51476 0.0072 0.201426626822702
rs116400033 chr1 51479 0.2055 1.18445621536109
rs62637813 chr1 52058 0.0587 0.551216300225955
rs190291950 chr1 52144 -4e-04 0.036575951491895
rs150021059 chr1 52238 0.3325 1.70427928591544
rs140052487 chr1 54353 0.003 0.12778378962414
rs146477069 chr1 54421 0.1419 0.924336309646664
rs141149254 chr1 54490 0.1767 1.06786868821145
rs2462492 chr1 54676 0.0819 0.664355314594874
rs143174675 chr1 54753 0.026 0.356836206987615
rs3091274 chr1 55164 0.3548 1.80091078751368
rs10399749 chr1 55299 0.0309 0.389748348495465
rs182462964 chr1 55313 2e-04 0.0877969207975495
rs3107975 chr1 55326 0.0237 0.344080010917931
rs142800240 chr1 7323573 -6e-04 0.0361473609720785
file 2
51083_1 chr1 51083 56000 -0.177152387075888 0.172569306719619
57083_1 chr1 57083 60083 -0.0524335467819781 0.130497858911419
60083_1 chr1 70083 83083 -0.0332555672564894 0.124932838766226
525083_1 chr1 525083 528083 0.291406335374442 0.0577249392691202
528083_1 chr1 528083 531083 0.291406335374442 0.0577249392691202
531083_1 chr1 531083 534083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.441406335374442 0.0577249392691202
What I get with this script:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key > $3 && key < $4)
print score[key], $5
}
' file1 file2 > output
output
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894 <- this should not appear

awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key+0 > $3 && key+0 < $4)
print score[key], $5
}
' fst.txt tajima.txt > output
gives me
[/tmp]$ cat output
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs140052487 chr1 54353 0.003 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
to force the interpretation as a number, add 0 to it. from the man page for awk.

I can reproduce your problem on Mac OS X 10.11.3 with the system's BSD awk.
The problem is to do with string vs number comparison; awk appears to be treating the key as a string and is doing a string comparison rather than a numerical comparison.
I've brute-forced it into treating the comparison numerically with:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
{
if (key+0 > $3+0 && key+0 < $4+0)
{
#print "==", key, $3, $4
#if (key > $3) print key, ">", $3
#if (key < $4) print key, "<", $4
print score[key], $5
}
}
}
' file1 file2
You can see the '+0' to force awk to treat things as numbers. (The analogue to force awk to treat a value as a string is, for example, key "", which concatenates an empty string to the (string) value of key.)
With your sample data, I then get the output:
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
Part of the debugging output, which gave the game away, was:
== 54676 51083 56000
54676 > 51083
54676 < 56000
rs2462492 chr1 54676 0.0819 -0.177152387075888
== 7323573 70083 83083
7323573 > 70083
7323573 < 83083
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894
For the 5-digit strings, the comparison happened to work the same as a numeric comparison. For the other, it did not. I should also point out that the $3+0 and $4+0 parts are probably not essential. I had those when I got the debugging output shown — but the tests only started to work when I added 0 to the key. I probably don't need to add the 0 to $3 or $4, therefore.

Related

If value of a column equals value of same column in previous line plus one, give the same code

I have some data that looks like this:
chr1 3861154 N 20
chr1 3861155 N 20
chr1 3861156 N 20
chr1 3949989 N 22
chr1 3949990 N 22
chr1 3949991 N 22
What I need to do is to give a code based on column 2. If the value equals the value of previous line plus one, then they come from the same series and I need to give them the same code in a new column. That code could be the value of the first line of that series. The desired output for this example would be:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
I was thinking of using awk, but of course that's not a requirement.
Any ideas of how could I make this work?
Edit to add the code I'm working in:
awk 'BEGIN {var = $2} {if ($2 == var+1) print $0"\t"var; else print $0"\t"$2; var = $2 }' test
I think the idea is there, but it's not quite right yet. The result I'm getting is:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861155
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949990
Thanks!
$ cat tst.awk
(NR == 1) || ($2 != (prev+1)) {
val = $2
}
{
print $0, val
prev = $2
}
$ awk -f tst.awk file
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
The big mistake in your script was this part:
BEGIN {var = $2}
because:
$2 is the 2nd field of the current line of input.
BEGIN is executed before any input lines have been read.
So the value of $2 in the BEGIN section is zero-or-null just like any other unset variable.

Awk print file name within for loop

I am running a for-loop over several files located in a directory. Each command within the for-loop appends the former .txt with a new column. Currently, the 3rd line in the for loop creates a column with the filepath, but I want just the filename - I don't need the file extension either. I've played around with splitting and piping back into awk, but no luck.
After adjusting the awk command to get just the filename, I want to then make a master .txt file that contains all looped values. Essentially I think I would need to append a .txt file with the output from each loop. Right now that's what I'm trying to do with the pipe in the third line of the for loop, but it just creates an empty .txt file.
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(samtools view -c "$file")" ${file%%_RRemoved.bam}_counts.txt > ${file%%_RRemoved.bam}_CPMcounts.txt
awk -v var="$file" '{print $0, var}' ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt | >> CPMcountsMaster.txt
done
Current filename1_CPMcounts2.txt output
chr1 11088 11488 peak_1 192 4409922 path/to/filename1.bam
chr1 20674 21215 peak_2 217 4409922 path/to/filename1.bam
chr1 28550 28862 peak_3 170 4409922 path/to/filename1.bam
chr1 29582 30300 peak_4 437 4409922 path/to/filename1.bam
chr1 30635 31720 peak_5 696 4409922 path/to/filename1.bam
chr1 32373 35541 peak_6 2877 4409922 path/to/filename1.bam
Current filename2_CPMcounts2.txt output
chr1 11088 11488 peak_1 293 5888360 path/to/filename2.bam
chr1 20674 21215 peak_2 439 5888360 path/to/filename2.bam
chr1 28550 28862 peak_3 392 5888360 path/to/filename2.bam
chr1 29582 30300 peak_4 901 5888360 path/to/filename2.bam
Desired filename1_CPMCounts2.txt output
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
Desired Final CPMcountsMaster.txt
chr1 11088 11488 peak_1 192 4409922 filename1
chr1 20674 21215 peak_2 217 4409922 filename1
chr1 28550 28862 peak_3 170 4409922 filename1
chr1 29582 30300 peak_4 437 4409922 filename1
chr1 30635 31720 peak_5 696 4409922 filename1
chr1 32373 35541 peak_6 2877 4409922 filename1
chr1 11088 11488 peak_1 293 5888360 filename2
chr1 20674 21215 peak_2 439 5888360 filename2
chr1 28550 28862 peak_3 392 5888360 filename2
chr1 29582 30300 peak_4 901 5888360 filename2
The following works, adapted from J Leffler's comment - thanks!
for file in ~/Desktop/test/*bam
do
bedtools multicov -bams "$file" -bed bed_for_multicov.bed > "${file%%_RRemoved.bam}_counts.txt"
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt > ${file%%_RRemoved.bam}_CPMcounts2.txt
awk '{print $0,a}' a="$(basename "$file" _RRemoved.bam)" ${file%%_RRemoved.bam}_CPMcounts.txt >> CPMcountsMaster.txt
done

How to merge two files into a unique file based on 2 columns of each file

I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF

Merge tab delimited files one upon other without header

Hi I have two tab delimited text files
file.1.txt
Chr Start End Ref Alt
chr1 4204560 4204560 T C
chr1 9471179 9471181 ATA -
chr1 9471230 9471230 A C
chr1 9471247 9471247 T C
chr1 9471254 9471254 C A
chr1 9471261 9471262 AA -
chr1 9471262 9471262 A C
AND file.2.txt
Chr Start End Ref Alt
chr1 9471268 9471268 - ACT
chr1 9471274 9471274 A C
chr1 9471275 9471275 A C
chr1 9471284 9471284 T C
chr1 9471297 9471297 T C
chr1 9471302 9471302 T C
chr1 9471312 9471312 A C
Now if I want to combine these two files in such a way that second file's header row is excluded and files are combined one upon other
Chr Start End Ref Alt
chr1 4204560 4204560 T C
chr1 9471179 9471181 ATA -
chr1 9471230 9471230 A C
chr1 9471247 9471247 T C
chr1 9471254 9471254 C A
chr1 9471261 9471262 AA -
chr1 9471262 9471262 A C
chr1 9471268 9471268 - ACT
chr1 9471274 9471274 A C
chr1 9471275 9471275 A C
chr1 9471284 9471284 T C
chr1 9471297 9471297 T C
chr1 9471302 9471302 T C
chr1 9471312 9471312 A C
How to do this using awk command or shell script?
P.S. Number of columns in actual files are 168.
Following awk may also help you on same too.
awk 'FNR==NR{print;next} FNR!=NR && FNR>1{print}' file1.txt file2.txt
OR more precisely no need of FNR!=NR:
awk 'FNR==NR{print;next} FNR>1{print}' file1.txt file2.tx
You can use following awk:
awk 'FNR > 1 || NR == 1' file1 file2
Chr Start End Ref Alt
chr1 4204560 4204560 T C
chr1 9471179 9471181 ATA -
chr1 9471230 9471230 A C
chr1 9471247 9471247 T C
chr1 9471254 9471254 C A
chr1 9471261 9471262 AA -
chr1 9471262 9471262 A C
chr1 9471268 9471268 - ACT
chr1 9471274 9471274 A C
chr1 9471275 9471275 A C
chr1 9471284 9471284 T C
chr1 9471297 9471297 T C
chr1 9471302 9471302 T C
chr1 9471312 9471312 A C
Or else using just cat and tail:
cat file1; tail -n +2 file2
Another awk:
$ awk 'p<FNR;{p=FNR}' file1 file2
outputs:
Chr Start End Ref Alt
chr1 4204560 4204560 T C
chr1 9471179 9471181 ATA -
chr1 9471230 9471230 A C
chr1 9471247 9471247 T C
chr1 9471254 9471254 C A
chr1 9471261 9471262 AA -
chr1 9471262 9471262 A C
chr1 9471268 9471268 - ACT
chr1 9471274 9471274 A C
chr1 9471275 9471275 A C
chr1 9471284 9471284 T C
chr1 9471297 9471297 T C
chr1 9471302 9471302 T C
chr1 9471312 9471312 A C
ie. output if previous FNR p is less than current.
Any reason not to just do:
cat file1; tail +2 file2

awk trying to use output as input and erroring

In the below awk I am trying to combine all matching $4 into a single $5 (up to the -), and average all values in $7. Why is the awk complaining about the output not being foung (that is the /home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/${pref}_genes.txt). Thank you :).
input (`/home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/*30reads_perbase.txt')
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 4 14
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 1 28
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 2 27
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 3 27
desired output
chr1:955543-955763 4 AGRN 15
chr1:976035-976270 3 AGRN 27
awk
for f in /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/*30reads_perbase.txt ; do bname=`basename "$f"`; pref=${bname%%.txt}; awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
END{for(k in a)
split(k,ks,FS);
print ks[1],c[k],ks[2],a[k]/c[k]}' "$f" > /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/"${pref}"_genes.txt; done
current output
chr1:976035-976270 3 AGRN 27.3333
Using the functions substr and match when you are printing the variables:
cat | awk '{k=$4 FS $5; a[k]+=$7; c[k]++}END{for(k in a)split(k,ks,FS);print ks[1],c[k],substr(ks[2],0,match(ks[2],"-")-1),a[k]/c[k]}'
chr1:955543-955763 4 AGRN 15.25

Resources