Obtain fraction of gene covered from bam file using bed file - bam

I am trying to calculate the gene coverage of specific genes in a BAM file. I have a list of genes with thier start and end positions in a BED file. I would essentially like to know the number of overlaps for each gene and how well it is covered.
My bed file looks like this:
track name="tb_ncbiRefSeqCurated" description="table browser query on ncbiRefSeqCurated" visibility=3 url=
chr10 100042192 100081869 NM_001308.3 CPN1
chr10 100150093 100186029 NM_006459.4 ERLIN1
chr10 100188299 100229596 NM_001278.5 CHUK
chr10 100232297 100267638 NM_001303405.2 CWF19L1
chr10 100523728 100529923 NM_001284368.1 NDUFB8
chr10 100523739 100529871 NM_001284367.2 NDUFB8
chr10 100735395 100747437 NM_001374303.1 PAX2
chr10 100735395 100829944 NM_001304569.2 PAX2
I have tried to covert the bam file to a bed file and then used bedmap to obtain this, but I dont get the fraction of coverage. This is the code I used:
bam2bed <../../bam_files/NA12878-200304612B-v15-450-Rep10_35x.bam | bedmap --delim '\t' --echo --count --bases-uniq --bases-uniq-f --echo-ref-size --mean ../../../reference_genes/5_selectedCols_mendel_genes - > test_bed
I get an output that looks like this:
chr10 100150093 100186029 NM_006459.4 ERLIN1 0 0 0.000000 35936 NAN
chr10 100188299 100229596 NM_001278.5 CHUK 0 0 0.000000 41297 NAN
chr10 100232297 100267638 NM_001303405.2 CWF19L1 0 0 0.000000 35341 NAN
chr10 100523728 100529923 NM_001284368.1 NDUFB8 0 0 0.000000 6195 NAN
chr10 100523739 100529871 NM_001284367.2 NDUFB8 0 0 0.000000 6132 NAN
chr10 100735395 100747437 NM_001374303.1 PAX2 0 0 0.000000 12042 NAN
chr10 100735395 100829944 NM_001304569.2 PAX2 0 0 0.000000 94549 NAN
chr10 100745581 100829944 NM_003989.5 PAX2 0 0 0.000000 84363 NAN
The counts and the fraction of bases are not calculated.
Could you help me to find a better solution. I am sure this has been done before but I just cant find it.

Related

bash/sed: producing 2D bar plots from multi-column data

I am dealing with the analysis of multi-column data organized in the following manner:
#Acceptor DonorH Donor Frames Frac AvgDist AvgAng
lig_608#O1 GLU_166#H GLU_166#N 13731 0.6865 2.8609 160.4598
lig_608#O2 HIE_163#HE2 HIE_163#NE2 8320 0.4160 2.8412 150.3665
lig_608#N2 ASN_142#HD21 ASN_142#ND2 1575 0.0788 2.9141 157.3493
lig_608#N2 THR_25#HG1 THR_25#OG1 218 0.0109 2.8567 156.0376
lig_608#O1 GLN_189#HE22 GLN_189#NE2 72 0.0036 2.8427 157.3778
lig_608#N2 ASN_142#HD22 ASN_142#ND2 43 0.0022 2.9093 165.3063
lig_608#N2 SER_46#HG SER_46#OG 32 0.0016 2.8710 159.8673
lig_608#F1 HIE_41#HE2 HIE_41#NE2 31 0.0015 2.8904 153.0763
lig_608#O2 SER_144#HG SER_144#OG 20 0.0010 2.8147 144.6951
lig_608#N2 THR_24#HG1 THR_24#OG1 16 0.0008 2.8590 165.3937
lig_608#O2 GLY_143#H GLY_143#N 15 0.0008 2.8729 149.1930
lig_608#F1 GLN_189#HE22 GLN_189#NE2 15 0.0008 2.9192 146.2273
lig_608#O2 SER_144#H SER_144#N 10 0.0005 2.9259 148.8008
lig_608#N2 THR_26#H THR_26#N 8 0.0004 2.9491 149.1861
lig_608#O2 GLU_166#H GLU_166#N 4 0.0002 2.8839 150.1238
lig_608#N2 GLN_189#HE21 GLN_189#NE2 3 0.0001 2.9567 153.7993
lig_608#N2 ASN_119#HD21 ASN_119#ND2 2 0.0001 2.8564 147.7916
lig_608#O2 CYS_145#H CYS_145#N 2 0.0001 2.8867 151.6423
lig_608#O1 GLN_189#HE21 GLN_189#NE2 2 0.0001 2.8888 148.3678
lig_608#N2 GLY_143#H GLY_143#N 2 0.0001 2.9658 149.2518
lig_608#F1 GLN_189#HE21 GLN_189#NE2 1 0.0001 2.8675 139.9754
lig_608#F1 GLN_189#H GLN_189#N 1 0.0001 2.8987 168.1758
lig_608#N2 HIE_41#HE2 HIE_41#NE2 1 0.0001 2.9411 147.0443
From this I need to take into account the info from the third column (donor) as well as the fifth column (Frac) and print the 2D histogram of the data taking into account the values (of the fifth column) bigger then 0.01. So in the demonstrated example, only the following data should be considered:
#Donor #Frac
GLU_166#N 0.6865
HIE_163#NE2 0.4160
ASN_142#ND2 0.0788
THR_25#OG1 0.0109
and the 2D histogram should plot # Donor on X and #Frac on Y (in %)
Before I had to add the following lines to the reduced 2D datafile in order that it could be recognized by gracebat as 2D bar plot:
# title "No tittle"
# xaxis label "Donor"
# yaxis label "Frac"
#s0 line type 0
#TYPE bar
# here is the data in 2 column format
Is it possible to automatize such file post-processing to produce the bar plot on-the-fly ? alternatively I would be grateful for sed solution to edit the datafile on the fly to reduce it to 2 columns and insert in the begining # lines required for bar graph ploting using:
sed -i 's/old-text/new-text/g' datafile
sed isn't meant for this kind of task, you should use awk:
awk '
BEGIN {
print "# title \"No title\""
print "# xaxis label \"Donor\""
print "# yaxis label \"Frac\""
print "#s0 line type 0"
print "#TYPE bar"
}
NR > 1 && $5 > 0.01 { print $3, $5 }
' file.txt
You could also do this with an on-the-fly generated Gnuplot script, e.g.:
cat <<EOS | gnuplot > output.png
set term pngcairo size 1280,960
set xtics noenhanced
set xlabel "Frac"
set ylabel "Donor"
set key off
set style fill solid 0.5
set boxwidth 0.9
plot "<awk 'NR == 1 || \$5 > 0.01' infile.tsv" using 0:5:xtic(3) with boxes
EOS
Which results in a png file:

Replacing the value of specific field in a table-like string stored as bash variable

I am looking for a way to replace (with 0) a specific value (1043252782) in a "table-like" string stored as a bash variable. The output of echo "$var"looks like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 1043252782
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
After the replacement echo "$var" should look like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
Is there a way to do this without saving the content of $var to a file and directly manipulating it within the bash (shell script)?
Maby with awk? I can select the value in the 10th field of the second record with awk and pattern matching ("7 Seek_Error_Rate ....") like this:
echo "$var" | awk '/^ 7/{print $10}'
Maby there is some way doing it with awk (or other cli-tool) to replace it and store it back into $var? Also, the value changes over time, but the structure remains the same (some record at the 10th field).
You can change a specific string directly in the shell:
var=${var/1043252782/0}
To replace final number of second line, you could use awk or sed:
var=$(awk 'NR==2 { sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '2s/[0-9][0-9]*$/0/' <<<"$var")
If you don't know which line it will be, you can match a known string:
var=$(awk '/Seek_Error_Rate/{ sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '/Seek_Error_Rate/s/[0-9][0-9]*$/0/' <<<"$var")
You can use a here-string to feed the variable as input to awk.
Use sub() to perform a regular expression replacement.
var=$(awk '{sub(/1043252782$/, "0")}1' <<<"$var")
Using sed
$ var=$(sed '/1043252782$/s//0/' <<< "$var")
$ echo "$var"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
if you don't wanna ruin formatting of tabs and spaces :
{m,g}wk NF=NF FS=' 1043252782$' OFS=' 0'
:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
or doing the whole file in one single shot :
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS='^$' ORS=
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS= -- (This might work too but I'm not too well versed in any side effects for blank RS)

Bash/Awk Compare two files, print value when it is between coordinates else print 0

I have two files. If the column "chromosome" matches between the two files and the position of File1 is between the Start_position and the End_position of File2, I would like to associate the two cell_frac values. If the Gene (chromosome + position) is not present in File2, I would like both cell_frac values to be equal to 0.
File1:
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
File2:
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
Desired output:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Edit: Here is the beginning of the code I used for now (not correct output):
awk '
NR==FNR{ range[$1,$2,$3,$4,$5]; next }
FNR==1
{
for(x in range) {
split(x, check, SUBSEP);
if($2==check[1] && $3>=check[2] && $3<=check[3]) { print $1"\t"$2"\t"$3"\t"check[4]"\t"check[5]}
}
}
' File2 File1
However, I did not manage to associate a 0 (with "else") when the gene was not present. I get the wrong number of lines. I Can you give me more hints?
Thanks a lot.
One awk-only idea ...
NOTE: see my other answer for assumptions/understandings and my version of file1
awk ' # process file2
FNR==NR { c=$1 # save chromosome value
$1="" # clear field #1
file2[c]=$0 # use chromosome as array index; save line in array
next
}
# process file1
{ start=end=-9999999999 # default values in case
a1=a2="" # no match in file2
if ($2 in file2) { # if chromosome also in file2
split(file2[$2],arr) # split file2 data into array arr[]
startpos =arr[1]
endpos =arr[2]
a1 =arr[3]
a2 =arr[4]
}
# if not the header row and file1/position outside of file2/range then set a1=a2=0
if (FNR>1 && ($3 < startpos || $3 > endpos)) a1=a2=0
print $0,a1,a2
}
' file2 file1
This generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Changing the last line to ' file2 file1 | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Presorting file1 by Chromosome and Position by changing last line to ' file2 <(head -1 file1; tail -n +2 file1 | sort -k2,3) | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
One big issue (same as with my other answer) ... the actual code may become unweidly when dealing with 519 total columns especially if there's a need to intersperse a lot of columns; otherwise OP may be able to use some for loops to more easily print ranges of columns.
A job for sql instead of awk, perhaps?
tr -s ' ' '|' <File1 >file1.csv
tr -s ' ' '|' <File2 >file2.csv
(
echo 'Hugo_Symbol|Chromosome|Position|cell_frac_A1|cell_frac_A2'
sqlite3 <<'EOD'
.import file1.csv t1
.import file2.csv t2
select distinct
t1.hugo_symbol,
t1.chromosome,
t1.position,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a1
else 0
end,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a2
else 0
end
from t1 join t2 on t1.chromosome=t2.chromosome;
EOD
rm file[12].csv
) | tr '|' '\t'
Assumptions and/or understandings based on sample data and OP comments ...
file1 is not sorted by chromosome
file2 is sorted by chromosome
common headers in both files are spelled the same (eg, file1:Chromosome vs file2:Chromosom)
if a chromosome exists in file1 but does not exist in file2 then we keep the line from file1 and the columns from file2 are left blank
both files are relatively small (file1: 5MB, 900 lines; file2: few KB, 50 lines)
NOTE: the number of columns (file1: 500 columns; file2: 20 columns) could be problematic from the point of view of cumbersome coding ... more on that later ...
Sample inputs:
$ cat file1 # scrambled chromsome order; added chromosome=4 line
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene3 2 333333
Gene2 1 222222
Gene4 2 333337
Gene5 4 444567 # has no match in file2
$ cat file2
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
First issue is to sort file1 by Chromosome and Position and also keep the header line in place:
$ (head -1 file1; tail -n +2 file1 | sort -k2,3)
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
Gene5 4 444567
We can now join the 2 files based on the Chromosome column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2
Chromosome Hugo_Symbol Position Start_Position End_Position cell_frac_A1 cell_frac_A2
1 Gene1 111111 222220 222230 0.12 0.01
1 Gene2 222222 222220 222230 0.12 0.01
2 Gene3 333333 333330 333340 0.03 0.25
2 Gene4 333337 333330 333340 0.03 0.25
4 Gene5 444567
Where:
-1 2 -2 1 - join on Chromosome columns: -1 2 == file #1 column #2; -2 1 == file #2 column #1
-a 1 - keep columns from file #1 (sorted file1)
--nocheck-order - disable verifying input is sorted by join column; optional; may be needed if a locale thinks 1 should be sorted before Chromosome
NOTE: for the sample inputs/outputs we don't need a special output format so we can skip the -o option, but this may be needed depending on OP's output requirements for 519 total columns (but it may also become unwieldly)
From here OP can use bash or awk to do comparisons (is column #3 between columns #4/#5); one awk idea:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}'
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0 # Position outside of range
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0 # no match in file2; if there were other columns from file2 they would be empty
And to match OP's sample output (appears to be a fixed width requirement) we can pass this to column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}' | column -t
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
NOTE: Keep in mind this may be untenable with OP's 519 total columns, especially if interspersed columns contain blanks/white-space (ie, column -t may not parse the input properly)
Issues (in addition to incorrect assumptions and previous NOTES):
for relatively small files the performance of the join | awk | column should be sufficient
for larger files all of this code can be rolled into a single awk solution though memory usage could be an issue on a small machine (eg, one awk idea would be to load file2 into memory via arrays so memory would need to be large enough to hold all of file2 ... probably not an issue unless file2 gets to be 100's/1000's of MBytes in size)
for 519 total columns the awk/print will get unwieldly especially if there's a need to move/intersperse a lot of columns

Sort a file to put 10, 11, 12... before 1, 2, 3... and X,Y

I have a list of chromosome data with the columns (chromosome, start, and end) like this:
chr1 6252071 6253740
chr1 6965107 6966070
chr1 6966038 6967016
chr1 7066595 7068694
chr1 7100956 7102296
chr1 7153422 7154635
chr1 7155112 7156181
....
chr2
....
chr10
....
chrX
....
chrY
....
etc.
I am trying to use bash to sort the chromosome sections to this order:
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
in the first column, and then in numerical order by start position in the second column, but no variation of sort seems to do the job. Any ideas? Thanks.
Split your file into two streams with separate filtering, then recombine them:
cat <(grep '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
<(grep -v '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
>outputfile
perl -E '
open $f, "<", shift;
say join "",
map {$_->[0]}
sort {length($b->[1]) <=> length($a->[1]) or $a->[1] cmp $b->[1]}
map {[$_, (split)[0]]}
<$f>
' file
It first opens the file.
Then it uses a Schwartzian Transform: read the next command from the bottom up:
read the lines: <$f>
transform the lines to a list of pairs: the original line, and the first word:
map {[$_, (split)[0]}
sort, first by length (longest to shortest), then lexically (A to Z)
transform the list of pairs to a list of lines (the first element of the pair)
map {$_->[0]}
join (the lines still have their newlines, so join on the empty string

KSH remove digits from number precision

I am trying to compare 2 log files in a ksh script that look like
log1:
10100 951 5 20150318 20150430
10101 11950 0 20150323 20150630
10102 285933 1 20150128 20150430
10041 57007 3.53 20150128 20150430
log2
10100 951 5.0000 20150318 20150430
10101 11950 0.0000 20150323 20150630
10102 285933 1.0000 20150128 20150430
10041 57007 3.5300 20150128 20150430
Log1 on column 3 has maximum 2 digits after the . (eg: 3.53)
Log2 on colum 3 always has 4 digits after the . (eg: 0.0000 or 3.5300)
How could I add some digits after . for the first log or remove the digits in log2 in order to be able to compare them line for line?
My script is written in ksh.
You should format the value with printf:
cat log1 | while read col1 col2 col3 col4 col5; do
printf "%d %d %.4f %d %d\n" ${col1} ${col2} ${col3} ${col4} ${col5}
done > log1.converted
Above code reads easy, but makes an unnecessary call to cat ("UUOC").
The better way to write this is
while read col1 col2 col3 col4 col5; do
printf "%d %d %.4f %d %d\n" ${col1} ${col2} ${col3} ${col4} ${col5}
done < log1 > log1.converted

Resources