I have files with the format:
ATOM 3736 CB THR A 486 -6.552 153.891 -7.922 1.00115.15 C
ATOM 3737 OG1 THR A 486 -6.756 154.842 -6.866 1.00114.94 O
ATOM 3738 CG2 THR A 486 -7.867 153.727 -8.636 1.00115.11 C
ATOM 3739 OXT THR A 486 -4.978 151.257 -9.140 1.00115.13 O
HETATM10351 C1 NAG B 203 33.671 87.279 39.456 0.50 90.22 C
HETATM10483 C1 NAG Z 702 28.025 104.269 -27.569 0.50 92.75 C
ATOM 3736 CB THR X 486 -6.552 86.240 7.922 1.00115.15 C
ATOM 3737 OG1 THR X 486 -6.756 85.289 6.866 1.00114.94 O
ATOM 3738 CG2 THR X 486 -7.867 86.404 8.636 1.00115.11 C
ATOM 3739 OXT THR X 486 -4.978 88.874 9.140 1.00115.13 O
HETATM10351 C1 NAG Y 203 33.671 152.852 -39.456 0.50 90.22 C
HETATM10639 C2 FUC C 402 -48.168 162.221 -22.404 0.50103.03 C
For each block of lines starting with HETATM*, I would like to change column 5 to match that of the previous ATOM block. It means that for the first HETATM* block both B and Z will change to A, whereas for the second HETATM* block both Y and C will change to X.
A second question, I do not really need to do it, it is just out of curiosity, how would I split the file after each line starting with HETATM* but only if the next line is ATOM?
Try this:
awk '{
if( $1 == "ATOM" ) {
col5=$5;
}
else if( match($1,/HETATM[0-9]*/)) {
$5=col5;
}
print
}' < infile
awk '$1=="ATOM"{c=$5}/^HETATM/{ $5=c };1' file
To preserve space, use field separator
awk -F" " '/^ATOM/{c=$5}/^HETATM/{ $5=c };1' file
Here is my solution, which solves the first problem (replacing the fifth field) while preserving white spaces:
$1=="ATOM" {
fifthField=$5
# Block to determine which index position field #5 is
fifthField_index = 1
for (i = 0; i < 4; i++) {
// Skip until white space
for (; substr($0, fifthField_index, 1) != " "; fifthField_index++) { }
// Skip white spaces
for (; substr($0, fifthField_index, 1) == " "; fifthField_index++) { }
}
print;next
}
/^HETATM/ {
before_fifthField = substr($0, 1, fifthField_index - 1)
after_fifthField = substr($0, fifthField_index + 1, length($0))
print before_fifthField fifthField after_fifthField
next
}
1
It is not the most elegant solution, but it works. This solution assumes that the fifth field is a single character.
Related
how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it
Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.
Say I have this input:
num gene Label start end
n1 g1a1 L1 28 40
n1 g1a1 L2 9 42
n1 g1a1 L2 28 90
n1 g1a1 VE 64 209
n1 g1a1 VE 83 377
n1 g1a1 VR 91 377
n1 g1a1 V 378 1516
n1 g1a1 V 475 1613
n1 g1a1 V 1378 2105
n1 g2a1 VE 10209 10590
n1 g2a1 VE 11311 11590
n1 g2a1 VR 11301 11590
n2 g1a2 VE 83 377
n2 g1a2 VR 91 377
n3 g3a1 VR 105200 105801
The expected output:
num gene Label start end
n1 g1a1 L1 28 40
n1 g1a1 L2 28 90
n1 g1a1 VE 83 377
n1 g1a1 VR 91 377
n1 g1a1 V 378 1516
n1 g2a1 VE 11311 11590
n1 g2a1 VR 11301 11590
n2 g1a2 VE 83 377
n2 g1a2 VR 91 377
n3 g3a1 VR 105200 105801
I want to compare 2 numeric fields ($4 and $5) between row n and row(s) n+p according to field $3 and $2.
All start and end positions of labels ($3) are compared to the VR start or end positions by gene ($2) and by number ($1), except label L1.
So for example for n1:
if VR(end) = 377 then:
VE(end) = 377
V(start) = VR(end) + 1
L2(start) = VR(start) - 1
Here the schema of all labels gathered
To begin, I tried to write these awk command-line
using an array with labels as key to retrieve start and end positions easily of the corresponding label:
awk '{ f[$3]=$0 ; for (i=1 ; i <= NF ; i++) { print i "\t" $1 "\t" $2 "\t" $3 } }' data.txt
in the current row get elements from the next row:
awk ' NR>0 {print prev "\t" $3 "\t" $4 "\t" $5} {prev = $0}' input
I know how to extract information from columns but I hardly know from where to start writing an awk command-line for my comparison issue.
Any help or advice will be highly appreciated
Thanks in advance
You need to do something along these lines:
/[0-9]/ {
key = $2 $3
genes[$2] = 1
if ($3 == "VR") {
vr_start[key] = $4
vr_end[key] = $5
vr[key] = $0
} else {
c = ++count[key]
start[key c] = $4
end[key c] = $5
line[key c] = $0
}
}
END {
for (g in genes) {
gstart = vr_start[g "VR"]
gend = vr_end[g "VR"]
c = count[g "L1"]
for (i = 1; i <= c; i++) {
print line[g "L1" i]
}
c = count[g "L2"]
for (i = 1; i <= c; i++) {
if (start[g "L2" i] == gstart - 1) {
print line[g "L2" i]
}
}
c = count[g "VE"]
for (i = 1; i <= c; i++) {
if (end[g "VE" i] == gend) {
print line[g "VE" i]
}
}
print vr[g "VR"]
c = count[g "V"]
for (i = 1; i <= c; i++) {
if (start[g "V" i] == gend + 1) {
print line[g "V" i]
}
}
}
}
I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.
I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:
chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5
I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.
So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.
I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.
import pandas as pd
from StringIO import StringIO
s = """chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5"""
sio = StringIO(s)
df = pd.read_table(sio, sep=" ", header=None)
df.columns=["a", "b", "c", "d", "e"]
# The query expression is intuitive
r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
print r["e"].mean()
pandas might be better for such tabular data processing, and it's python.
Here's some python code to do what you are asking for. It assumes that your data lives in a text file called 'data.txt'
#!/usr/bin/env python
data = open('data.txt').readlines()
def avg(keys):
key_sum = 0
key_count = 0
for item in data:
fields = item.split()
krange = keys.get(fields[0], None)
if krange:
r = int(fields[1])
if krange[0] <= r and r <= krange[1]:
key_sum += float(fields[-1])
key_count += 1
print key_sum/key_count
keys = {} # Create dict to store keys and ranges of interest
keys['chr1'] = (810, 820)
keys['chr2'] = (310, 330)
avg(keys)
Sample Output:
0.42
Here's an awk script answer. For input, I created a 2nd file which I called ranges:
chr1 810 820
chr2 310 330
The script itself looks like:
#!/usr/bin/awk -f
FNR==NR { low_r[$1] = $2; high_r[$1] = $3; next }
{ l = low_r[ $1 ]; h = high_r[$1]; if( l=="" ) next }
$2 >= l && $2 <= h { total+=$5; cnt++ }
END {
if( cnt > 0 ) print (total/cnt)
else print "no matched data"
}
Where the breakdown is like:
FNR==NR - absorb the ranges file, making a low_r and high_r array keyed off of the first column in that file.
Then for every row in the data, lookup matches in the low_r and high_r array. If there's no match, then skip any other processing
Check an inclusive range based on low and high testing, incrementing total and cnt for matched ranges.
At the END, print the simple averages when there were matches
When the script (called script.awk) is made executable it can be run like:
$ ./script.awk ranges data
0.42
where I've called the data file data.
I have files with the following format:
ATOM 3736 CB THR A 486 -6.552 153.891 -7.922 1.00115.15 C
ATOM 3737 OG1 THR A 486 -6.756 154.842 -6.866 1.00114.94 O
ATOM 3738 CG2 THR A 486 -7.867 153.727 -8.636 1.00115.11 C
ATOM 3739 OXT THR A 486 -4.978 151.257 -9.140 1.00115.13 O
HETATM10351 C1 NAG A 203 33.671 87.279 39.456 0.50 90.22 C
HETATM10483 C1 NAG A 702 28.025 104.269 -27.569 0.50 92.75 C
ATOM 3736 CB THR B 486 -6.552 86.240 7.922 1.00115.15 C
ATOM 3737 OG1 THR B 486 -6.756 85.289 6.866 1.00114.94 O
ATOM 3738 CG2 THR B 486 -7.867 86.404 8.636 1.00115.11 C
ATOM 3739 OXT THR B 486 -4.978 88.874 9.140 1.00115.13 O
HETATM10351 C1 NAG B 203 33.671 152.852 -39.456 0.50 90.22 C
HETATM10639 C2 FUC B 402 -48.168 162.221 -22.404 0.50103.03 C
I would like to split the file after each line starting with HETATM* but only if the next line starts with ATOM. I would like the new files to be called $basename_$column, where $basename is the base name of the input file and $column is the character at position 22-23 (either A or B, in the example). I am not able to figure out how to check both consecutive lines to determine the splitting point.
Here's an awk version
awk 'NR==1{n=$5}/HETATM/{f=1}f && /^ATOM/{n=$5;f=0}{print > "file"n".txt"}' file
Use FILENAME instead of file to create the same file name.
Here's a simple Python solution with no error checking. Should work in Python 2 or 3; change the first line to match your environment. Don't take this as an example of good coding style.
Edited for unique file names.
#!/usr/bin/env python2.4
import os.path
import sys
fname = sys.argv[1]
bname = os.path.basename(fname)
fin = open(fname)
fout = None
ct = 0
for line in fin:
if line[:6] == 'HETATM':
flag = True
if (not fout) or (flag and line[:4] == 'ATOM'):
if fout:
fout.close()
ct += 1
fout = open(bname + '_' + line[21:22] + str(ct), 'w')
flag = False
fout.write(line)
fout.close()