awk trying to use output as input and erroring - bash

In the below awk I am trying to combine all matching $4 into a single $5 (up to the -), and average all values in $7. Why is the awk complaining about the output not being foung (that is the /home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/${pref}_genes.txt). Thank you :).
input (`/home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/*30reads_perbase.txt')
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 4 14
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 1 28
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 2 27
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 3 27
desired output
chr1:955543-955763 4 AGRN 15
chr1:976035-976270 3 AGRN 27
awk
for f in /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/*30reads_perbase.txt ; do bname=`basename "$f"`; pref=${bname%%.txt}; awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
END{for(k in a)
split(k,ks,FS);
print ks[1],c[k],ks[2],a[k]/c[k]}' "$f" > /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/"${pref}"_genes.txt; done
current output
chr1:976035-976270 3 AGRN 27.3333

Using the functions substr and match when you are printing the variables:
cat | awk '{k=$4 FS $5; a[k]+=$7; c[k]++}END{for(k in a)split(k,ks,FS);print ks[1],c[k],substr(ks[2],0,match(ks[2],"-")-1),a[k]/c[k]}'
chr1:955543-955763 4 AGRN 15.25

Related

If value of a column equals value of same column in previous line plus one, give the same code

I have some data that looks like this:
chr1 3861154 N 20
chr1 3861155 N 20
chr1 3861156 N 20
chr1 3949989 N 22
chr1 3949990 N 22
chr1 3949991 N 22
What I need to do is to give a code based on column 2. If the value equals the value of previous line plus one, then they come from the same series and I need to give them the same code in a new column. That code could be the value of the first line of that series. The desired output for this example would be:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
I was thinking of using awk, but of course that's not a requirement.
Any ideas of how could I make this work?
Edit to add the code I'm working in:
awk 'BEGIN {var = $2} {if ($2 == var+1) print $0"\t"var; else print $0"\t"$2; var = $2 }' test
I think the idea is there, but it's not quite right yet. The result I'm getting is:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861155
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949990
Thanks!
$ cat tst.awk
(NR == 1) || ($2 != (prev+1)) {
val = $2
}
{
print $0, val
prev = $2
}
$ awk -f tst.awk file
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
The big mistake in your script was this part:
BEGIN {var = $2}
because:
$2 is the 2nd field of the current line of input.
BEGIN is executed before any input lines have been read.
So the value of $2 in the BEGIN section is zero-or-null just like any other unset variable.

How to merge two files into a unique file based on 2 columns of each file

I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF

Awk does not recognize fields as integer values

I'm trying to filter one file based on two columns of another one.
The problem is that awk is not differentiating, for example, this interval 70083 83083, from position 7323573 (please see below).
The aim is to retrieve the value for file 1 that is in the column 5 of file 2.
File 1 has only one position in the column 3 ex: 51476, and the file 2 has an interval represented by column 3 and 4.
In the end I need the file 1 with respective values of the column 5 (see output).
file 1
rs187298206 chr1 51476 0.0072 0.201426626822702
rs116400033 chr1 51479 0.2055 1.18445621536109
rs62637813 chr1 52058 0.0587 0.551216300225955
rs190291950 chr1 52144 -4e-04 0.036575951491895
rs150021059 chr1 52238 0.3325 1.70427928591544
rs140052487 chr1 54353 0.003 0.12778378962414
rs146477069 chr1 54421 0.1419 0.924336309646664
rs141149254 chr1 54490 0.1767 1.06786868821145
rs2462492 chr1 54676 0.0819 0.664355314594874
rs143174675 chr1 54753 0.026 0.356836206987615
rs3091274 chr1 55164 0.3548 1.80091078751368
rs10399749 chr1 55299 0.0309 0.389748348495465
rs182462964 chr1 55313 2e-04 0.0877969207975495
rs3107975 chr1 55326 0.0237 0.344080010917931
rs142800240 chr1 7323573 -6e-04 0.0361473609720785
file 2
51083_1 chr1 51083 56000 -0.177152387075888 0.172569306719619
57083_1 chr1 57083 60083 -0.0524335467819781 0.130497858911419
60083_1 chr1 70083 83083 -0.0332555672564894 0.124932838766226
525083_1 chr1 525083 528083 0.291406335374442 0.0577249392691202
528083_1 chr1 528083 531083 0.291406335374442 0.0577249392691202
531083_1 chr1 531083 534083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.291406335374442 0.0577249392691202
534083_1 chr1 534083 537083 0.441406335374442 0.0577249392691202
What I get with this script:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key > $3 && key < $4)
print score[key], $5
}
' file1 file2 > output
output
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894 <- this should not appear
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
if (key+0 > $3 && key+0 < $4)
print score[key], $5
}
' fst.txt tajima.txt > output
gives me
[/tmp]$ cat output
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs140052487 chr1 54353 0.003 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
to force the interpretation as a number, add 0 to it. from the man page for awk.
I can reproduce your problem on Mac OS X 10.11.3 with the system's BSD awk.
The problem is to do with string vs number comparison; awk appears to be treating the key as a string and is doing a string comparison rather than a numerical comparison.
I've brute-forced it into treating the comparison numerically with:
awk '
NR == FNR {score[$3] = $1 FS $2 FS $3 FS $4; next}
{
for (key in score)
{
if (key+0 > $3+0 && key+0 < $4+0)
{
#print "==", key, $3, $4
#if (key > $3) print key, ">", $3
#if (key < $4) print key, "<", $4
print score[key], $5
}
}
}
' file1 file2
You can see the '+0' to force awk to treat things as numbers. (The analogue to force awk to treat a value as a string is, for example, key "", which concatenates an empty string to the (string) value of key.)
With your sample data, I then get the output:
rs140052487 chr1 54353 0.003 -0.177152387075888
rs150021059 chr1 52238 0.3325 -0.177152387075888
rs3107975 chr1 55326 0.0237 -0.177152387075888
rs3091274 chr1 55164 0.3548 -0.177152387075888
rs187298206 chr1 51476 0.0072 -0.177152387075888
rs116400033 chr1 51479 0.2055 -0.177152387075888
rs10399749 chr1 55299 0.0309 -0.177152387075888
rs146477069 chr1 54421 0.1419 -0.177152387075888
rs190291950 chr1 52144 -4e-04 -0.177152387075888
rs182462964 chr1 55313 2e-04 -0.177152387075888
rs141149254 chr1 54490 0.1767 -0.177152387075888
rs62637813 chr1 52058 0.0587 -0.177152387075888
rs143174675 chr1 54753 0.026 -0.177152387075888
rs2462492 chr1 54676 0.0819 -0.177152387075888
Part of the debugging output, which gave the game away, was:
== 54676 51083 56000
54676 > 51083
54676 < 56000
rs2462492 chr1 54676 0.0819 -0.177152387075888
== 7323573 70083 83083
7323573 > 70083
7323573 < 83083
rs142800240 chr1 7323573 -6e-04 -0.0332555672564894
For the 5-digit strings, the comparison happened to work the same as a numeric comparison. For the other, it did not. I should also point out that the $3+0 and $4+0 parts are probably not essential. I had those when I got the debugging output shown — but the tests only started to work when I added 0 to the key. I probably don't need to add the 0 to $3 or $4, therefore.

How to remove lines based on duplicated value in a specific field?

For example, I have this chromosome file:
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
...
I'd like to remove the third line because Region2 appeared on line 2. I would greatly appreciate any suggestion. Thank you!
Assuming you've got a tab delimiter, this should work using awk:
awk -F'\t' '!x[$4]++' file.txt
If its not tab, just change '\t' to whatever the delimiter is, as by default awk assumes space.
Here's an example showing the results:
input:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
awk:
awk -F'\t' '!x[$4]++' file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
This works by printing when an element is added to the array that has not been encountered before. It's a pretty standard deduping one-liner just modified to care about a specific field and not the whole line.
It works by adding the 4th field to an associative array and post increments it, so it returns 0 the first time it's added and increments with each subsequent duplicate item in the array. Adding in the ! to reverse this logic, we'll print if the post increment is 0, and not if its anything else, which it will be with each subsequent duplicate addition.
For example, adding a few more lines to the file:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
Chr1 499 555 Region2
Chr1 499 555 Region3
Chr1 499 556 Region3
And then changing our print to show the output we're testing:
~$ awk -F'\t' '{print x[$4]++}' file.txt
0
0
1
2
0
1
It should be much more obvious what is happening here.

printing selected rows from a file using awk

I have a text file with data in the following format.
1 0 0
2 512 6
3 992 12
4 1536 18
5 2016 24
6 2560 29
7 3040 35
8 3552 41
9 4064 47
10 4576 53
11 5088 59
12 5600 65
13 6080 71
14 6592 77
15 7104 83
I want to print all the lines where $1 > 1000.
awk 'BEGIN {$1 > 1000} {print " " $1 " "$2 " "$3}' graph_data_tmp.txt
This doesn't seem to give the output that I am expecting.What am I doing wrong?
You can do this :
awk '$1>1000 {print $0}' graph_data_tmp.txt
print $0 will print all the content of the line
If you want to print the content of the line after the 1000th line/ROW, then you could do the same by replacing $1 with NR. NR represents the number of rows.
awk 'NR>1000 {print $0}' graph_data_tmp.txt
All you need is:
awk '$1>1000' file

Resources