Binning Together Allele Frequencies From VCF Sequencing Data - bash

I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:
chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5
I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.
So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.
I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.

import pandas as pd
from StringIO import StringIO
s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""
sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]
# The query expression is intuitive
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()
pandas might be better for such tabular data processing, and it's python.

Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'
#!/usr/bin/env python
data = open('data.txt').readlines()
def avg(keys):
key_sum = 0
key_count = 0
for item in data:
fields = item.split()
krange = keys.get(fields[0], None)
if krange:
r = int(fields[1])
if krange[0] <= r and r <= krange[1]:
key_sum += float(fields[-1])
key_count += 1
print key_sum/key_count
keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)
avg(keys)
Sample Output:
0.42

Here's an awk script answer. For input, I created a 2nd file which I called ranges:
chr1 810 820
chr2 310 330
The script itself looks like:
#!/usr/bin/awk -f
FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }
{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }
$2 >= l && $2 <= h { total+=$5; cnt++ }
END {
if( cnt > 0 ) print (total/cnt)
else print "no matched data"
}
Where the breakdown is like:
FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
At the END, print the simple averages when there were matches
When the script (called script.awk) is made executable it can be run like:
$ ./script.awk ranges data
0.42
where I've called the data file data.

Related

Replace numeric genotype code with DNA letter

how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it
Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.

Calculate Percentile(s) in Bash

I am trying to calculate a range of percentiles (5th-99th) in Bash for a text file that contains 5 values, one per line.
Input
34.5
32.2
33.7
30.4
31.8
Attempted Code
awk '{s[NR-1]=$1} END{print s[int(0.05-0.99)]}' input
Expected Output
99th 34.5
97th 34.4
95th 34.3
90th 34.2
80th 33.9
70th 33.4
60th 32.8
50th 32.2
40th 32.0
30th 31.9
20th 31.5
10th 31.0
5th 30.7
For calculation of percentile based on 5 values, one need to create a mapping between percentiles, and to interpolate between them. A process called 'Piecewise Linear function' (a.k.a. pwlf).
F(100) = 34.5
F(75) = 33.7
F(50) = 32.2
F(25) = 31.8
F(0) = 30.4
Mapping of any other x in the range 0..100, require linear interpolation betweeh F(L), and F(H) - where L is the highest value >= x, and H=L+1.
awk '
#! /bin/env awk
# PWLF Interpolation function, take a value, and two arrays for X & Y
function pwlf(x, px, py) {
# Shortcut to calculate low index of X, >= p
p_l = 1+int(x/25)
p_h = p_l+1
x_l = px[p_l]
x_h = px[p_h]
y_l = py[p_l]
y_h = py[p_h]
#print "X=", x, p_l, p_h, x_l, x_h, y_l, y_h
return y_l+(y_h-y_l)*(x-x_l)/(x_h-x_l)
}
# Read f Input in yy array, setup xx
{ yy[n*25] = $1 ; n++ }
# Print the table
END {
# Sort values of yy
ny = asort(yy) ;
# Create xx array 0, 25, ..., 100
for (i=1 ; i<=ny ; i++) xx[i]=25*(i-1)
# Prepare list of requested results
ns = split("99 97 95 90 80 70 60 50 40 30 20 10 5", pv)
for (i=1 ; i<=ns ; i++) printf "%dth %.1f\n", pv[i], pwlf(pv[i], xx, yy) ;
}
' input
Technically a bash script, but based on comments to OP, better to place the whole think into script.awk, and execute as one lines. Solution has the '#!' to invoke awk script.
/path/to/script.awk < input

replacing specific value (from another file) using awk

I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.

awk skipping records. getline command

this is a task related to data compression using fibonacci binary representation.
what i have is this text file:
result.txt
a 20
b 18
c 18
d 15
e 7
this file is a result of scanning a text file and counting the appearances of each char on the file using awk.
now i need to give each char its fibonacci-binary representation length.
since i'm new to ubuntu and teminal, i've done a program in java that receives a number and prints all the fibonacci codewords length up to the number and it's working.
this is exactly what i'm trying to do here. the problem is that it doesn't work...
the length of fibonacci codewords is also work as fibonnaci.
these are the rules:
f(1)=1 - there is 1 codeword of length 1.
f(2)=1 - there is 1 codeword of length 2.
f(3)=2 - there is 2 codeword of length 3.
f(4)=3 - there is 3 codeword of length 4.
and so on...
(i'm adding on more bit to each codeword so the first two lengths will be 2 and 3)
this is the code i've made: its name is scr5
{
a=1;
b=1;
len=2
print $1 , $2, len;
getline;
print $1 ,$2, len+1;
getline;
len=4;
for(i=1; i< num; i++){
c= a+b;
g=c;
while (c >= 1){
print $1 ,$2, len ;
if (getline<=0){
print "EOF"
exit;
}
c--;
i++;
}
a=b;
b=c;
len++;
}}
now i write on terminal:
n=5
awk -v num=$n -f scr5 a
and there are two problems:
1. it skips the third letter c.
2. on the forth letter d, it prints the length of the first letter, 2, instead of length 3.
i guess that there is a problem in the getline command.
thank u very much!
Search Google for getline and awk and you'll mostly find reasons to avoid getline completely! Often it's a sign you're not really doing things the "awk" way. Find an awk tutorial and work through the basics and I'm sure you'll see quickly why your attempt using getlines is not getting you off in the right direction.
In the script below, the BEGIN block is run once at the beginning before any input is read, and then the next block is automatically run once for each line of input --- without any need for getline.
Good luck!
$ cat fib.awk
BEGIN { prior_count = 0; count = 1; len = 1; remaining = count; }
{
if (remaining == 0) {
temp = count;
count += prior_count;
prior_count = temp;
remaining = count;
++len;
}
print $1, $2, len;
--remaining;
}
$ cat fib.txt
a 20
b 18
c 18
d 15
e 7
f 0
g 0
h 0
i 0
j 0
k 0
l 0
m 0
$ awk -f fib.awk fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6
The above solution, compressed form :
mawk 'BEGIN{ ___= __= _^=____=+_ } !_ { __+=(\
____=___+_*(_=___+=____))^!_ } $++NF = (_--<_)+__' fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6

Change column according to previous line with conditions

I have files with the format:
ATOM 3736 CB THR A 486 -6.552 153.891 -7.922 1.00115.15 C
ATOM 3737 OG1 THR A 486 -6.756 154.842 -6.866 1.00114.94 O
ATOM 3738 CG2 THR A 486 -7.867 153.727 -8.636 1.00115.11 C
ATOM 3739 OXT THR A 486 -4.978 151.257 -9.140 1.00115.13 O
HETATM10351 C1 NAG B 203 33.671 87.279 39.456 0.50 90.22 C
HETATM10483 C1 NAG Z 702 28.025 104.269 -27.569 0.50 92.75 C
ATOM 3736 CB THR X 486 -6.552 86.240 7.922 1.00115.15 C
ATOM 3737 OG1 THR X 486 -6.756 85.289 6.866 1.00114.94 O
ATOM 3738 CG2 THR X 486 -7.867 86.404 8.636 1.00115.11 C
ATOM 3739 OXT THR X 486 -4.978 88.874 9.140 1.00115.13 O
HETATM10351 C1 NAG Y 203 33.671 152.852 -39.456 0.50 90.22 C
HETATM10639 C2 FUC C 402 -48.168 162.221 -22.404 0.50103.03 C
For each block of lines starting with HETATM*, I would like to change column 5 to match that of the previous ATOM block. It means that for the first HETATM* block both B and Z will change to A, whereas for the second HETATM* block both Y and C will change to X.
A second question, I do not really need to do it, it is just out of curiosity, how would I split the file after each line starting with HETATM* but only if the next line is ATOM?
Try this:
awk '{
if( $1 == "ATOM" ) {
col5=$5;
}
else if( match($1,/HETATM[0-9]*/)) {
$5=col5;
}
print
}' < infile
awk '$1=="ATOM"{c=$5}/^HETATM/{ $5=c };1' file
To preserve space, use field separator
awk -F" " '/^ATOM/{c=$5}/^HETATM/{ $5=c };1' file
Here is my solution, which solves the first problem (replacing the fifth field) while preserving white spaces:
$1=="ATOM" {
fifthField=$5
# Block to determine which index position field #5 is
fifthField_index = 1
for (i = 0; i < 4; i++) {
// Skip until white space
for (; substr($0, fifthField_index, 1) != " "; fifthField_index++) { }
// Skip white spaces
for (; substr($0, fifthField_index, 1) == " "; fifthField_index++) { }
}
print;next
}
/^HETATM/ {
before_fifthField = substr($0, 1, fifthField_index - 1)
after_fifthField = substr($0, fifthField_index + 1, length($0))
print before_fifthField fifthField after_fifthField
next
}
1
It is not the most elegant solution, but it works. This solution assumes that the fifth field is a single character.

Resources