BWA can't find the local index file - bioinformatics

I'm currently trying to import a .fasta file into bwa to use a reference genome to map my reads to. However, i currently am getting this error:
[E::bwa_idx_load_from_disk] fail to locate the index files
Any help? Here is my code:
#!/bin/bash
source /opt/asn/etc/asn-bash-profiles-special/modules.sh
module load fastqc/0.10.1
module load fastx/0.0.13
source /opt/asn/etc/asn-bash-profiles-special/modules.sh
module load pear/0.9.10
source /opt/asn/etc/asn-bash-profiles-special/modules.sh
module load fastqc/0.10.1
module load fastx/0.0.13
module load bwa/0.7.12
module load samtools/1.2
source /opt/asn/etc/asn-bash-profiles-special/modules.sh
module load trimmomatic/0.35
r=20
####mapping
#Indexing reference library for BWA mapping:
bwa index -a is ~/gz_files/sample_things/fungiref.fa fungiref
bwa mem fungiref sample${r}_clipped_paired.assembled.fastq > sample${r}.sam
#sort and convert to bam
samtools view -bS sample${r}.sam | samtools sort - sample{r}_sorted
#counts and stats
samtools index sample${r}_sorted.bam
samtools idxstats sample${r}_sorted.bam > ${r}_counts.txt

The usage for bwa index is
bwa index [-p prefix] [-a algoType] <in.db.fasta>
Your usage doesn’t match this; it’s a shame that bwa silently accepts this instead of immediately throwing an error. Annoyingly, there is also simply no way to specify a path prefix for the index. You’re stuck with the location of your reference.
At any rate, the index filename is derived from the FASTA reference file. As a consequence you need to adjust your index filename in the subsequent command:
bwa mem ~/gz_files/sample_things/fungiref.fa sample${r}_clipped_paired.assembled.fastq > sample${r}.sam

Related

Snakemake on cluster error: 'Wildcards' object has no attribute 'output'

I'm running into an error of 'Wildcards' object has no attribute 'output', similar to this earlier question 'Wildcards' object has no attribute 'output', when I submit Snakemake to my cluster. I'm wondering if you have any suggestions for how to make this compatible with the cluster?
While my rule annotate_snps works when I test it locally, I get the following error on the cluster:
input: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk.vcf.gz
output: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_rename.vcf.gz, results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_tmp.vcf.gz, results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_gatk_ann.vcf.gz
log: results/CI226380_S4/vars/CI226380_S4_bwa_H37Rv_annotate_snps.log
jobid: 1139
wildcards: samp=CI226380_S4, mapper=bwa, ref=H37Rv
WorkflowError in line 173 of /oak/stanford/scg/lab_jandr/walter/tb/mtb/workflow/Snakefile:
'Wildcards' object has no attribute 'output'
My rule is defined as:
rule annotate_snps:
input:
vcf='results/{samp}/vars/{samp}_{mapper}_{ref}_gatk.vcf.gz'
log:
'results/{samp}/vars/{samp}_{mapper}_{ref}_annotate_snps.log'
output:
rename_vcf=temp('results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_rename.vcf.gz'),
tmp_vcf=temp('results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_tmp.vcf.gz'),
ann_vcf='results/{samp}/vars/{samp}_{mapper}_{ref}_gatk_ann.vcf.gz'
params:
bed=config['bed_path'],
vcf_header=config['vcf_header']
shell:
'''
# Rename Chromosome to be consistent with snpEff/Ensembl genomes.
zcat {input.vcf}| sed 's/NC_000962.3/Chromosome/g' | bgzip > {output.rename_vcf}
tabix {output.rename_vcf}
# Run snpEff
java -jar -Xmx8g {config[snpeff]} eff {config[snpeff_db]} {output.rename_vcf} -dataDir {config[snpeff_datapath]} -noStats -no-downstream -no-upstream -canon > {output.tmp_vcf}
# Also use bed file to annotate vcf
bcftools annotate -a {params.bed} -h {params.vcf_header} -c CHROM,FROM,TO,FORMAT/PPE {output.tmp_vcf} > {output.ann_vcf}
'''
Thank you so much in advance!
The raw rule definition appears to be consistent except for the multiple calls to the contents of config, e.g. config[snpeff].
One thing to check is if the config definition on the single machine and on the cluster is the same, if it's not there might be some content that is confusing snakemake, e.g. if somehow config[snpeff] == "wildcards.output" (or something similar).

Cannot extract kernel source xz from c7 elrepo kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm

Please let me know what I need to do to extract linux-4.4.236.tar.xz from the rpm
My goal is to extract the kernel source and repackage it for use in out build process. We use the standard pattern for this but something funny is happening with some elrepo packages (kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm specifically)
List the contents of the package
rpm -qlp kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm
warning: kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm: Header V4 DSA/SHA1 Signature, key ID baadae52: NOKEY
config-4.4.236-x86_64
cpupower.config
cpupower.service
kernel-lt-4.4.spec
linux-4.4.236.tar.xz
List the contents of a cpio archive
We see linux-4.4.236.tar.xz. So, we'll use the rpm2cpio method and check the contents of the cpio archive but we've a problem as the table lacks linux-4.4.236.tar.xz
rpm2cpio kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm |cpio -t
config-4.4.236-x86_64
cpupower.config
cpupower.service
kernel-lt-4.4.spec
Extract contents from the archive
When we extract, we see all the items from the table and not linux-4.4.236.tar.xz
rpm2cpio kernel-lt-4.4.236-1.el7.elrepo.nosrc.rpm |cpio -idv
config-4.4.236-x86_64
cpupower.config
cpupower.service
kernel-lt-4.4.spec
514 blocks
This is a NO source RPM and according to this blog post this class of package does not contain source

Extracting unmapped reads where both mates are unmapped using samtools?

I'm trying to determine the best way to extract unmapped reads in which both mates in a pair did not map. Currently, it seems that my code is simply extracting all unmapped reads, regardless of their mate. I'm not sure how to go about this, as I'm already using the -f option to extract unmapped reads. Would I just do another iteration of samtools view?
samtools view -# 4 -buh -f4 sample${r}_pe.remove.sam > sample${r}_pe.unmapped.bam
To extract only the reads where read 1 is unmapped AND read 2 is unmapped (= both mates are unmapped):
samtools view -b -f12 input.sam > output.both_mates_unmapped.bam
Here, the options are:
-b - output BAM,
-f12 - filter only reads with flag: 4 (read unmapped) + 8 (mate unmapped).
SEE ALSO:
Decoding SAM flags: https://broadinstitute.github.io/picard/explain-flags.html

snakemake running nanopolish and making it wait until previous rule is done

Hi I can run the different steps of nanopolish with snakemake. But when I run it it will give an error that the index file created in the bwa rule isnt available yet. After it gives this error it creates the file it that the error was about. If I run snakemake again without removing files it works because the file is there. How can I tell snake make to wait with the next step until the first one is done? I have googled on any ways to solve this problem and all I could find was priority and ruleorder and I have used those but it still doesnt work. Here is the script that I use.
ruleorder: bwa > nanopolish
rule bwa:
input:
"nanopolish/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"envs/nanopolish.yaml"
priority:
50
shell:
"bwa index {input} - > {output}"
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"envs/nanopolish.yaml"
shell:
"bwa mem -x ont2d {input} | samtools sort -o {output} -T reads.tmp"
You should take a look again at the docs to properly understand the idea of SnakeMake.
Rules describe how to create output files from input files
A rule is not executed until all its input exists, so all you have to do is add the output of the bwa rule
rule nanopolish:
input:
"nanopolish/assembly.fasta",
"nanopolish/draft.fa", # <-- output of bwa
"zipped/zipped.gz"
Ruleorder and priority are not relevant solutions for your problem.

Postprocess drmemory error stacks with new symbols after process exits

After running a set of tests with drmemory overnight I am trying to resolve the error stacks by providing pdb symbols. The pdb's come from a large samba-mapped repository and using _NT_SYMBOL_PATH at runtime slowed things down too much.
Does anyone know of a tool that post-processes results.txt and pulls new symbols (via NT_SYMBOL_PATH or otherwise) as required to produce more detailed stacks ? If not, any hints for adapting asan_symbolize.py to do this ?
https://llvm.org/svn/llvm-project/compiler-rt/trunk/lib/asan/scripts/asan_symbolize.py
What I came up with so far using dbghelp.dll is below. Works but could be better.
https://github.com/patraulea/postpdb
ok this Query does not pertain to use of windbg or doesn't have anything to do with _NT_SYMBOL_PATH
Dr.Memory is a memory diagnostic tool akin to valgrind and is based on Dynamorio instumentation framework usable on raw unmodified binaries
on windows you can invoke it like drmemory.exe calc.exe from a command prompt (cmd.exe)
as soon as the binary finishes execution a log file named results.txt is written to a default location
if you had setup _NT_SYMBOL_PATH drmemory honors it and resolves symbol information from prepulled symbol file (viz *.pdb) it does not seem to download files from ms symbol server it simply seems to ignore the SRV* cache and seems to use only the downstream symbol folder
so if the pdb file is missing or isnt downloaded yet
the results.txt will contain stack trace like
# 6 USER32.dll!gapfnScSendMessage +0x1ce (0x75fdc4e7 <USER32.dll+0x1c4e7>)
# 7 USER32.dll!gapfnScSendMessage +0x2ce (0x75fdc5e7 <USER32.dll+0x1c5e7>)
while if the symbol file was available it would show
# 6 USER32.dll!InternalCallWinProc
# 7 USER32.dll!UserCallWinProcCheckWow
so basically you need the symbol file for appplication in question
so as i commented you need to fetch the symbols for the exe in question
you can use symchk on a running process too and create a manifest file
and you can use symchk on a machine that is connected to internet
to download symbols and copy it to a local folder on a non_internet machine
and point _NT_SYMBOL_PATH to this folder
>tlist | grep calc.exe
1772 calc.exe Calculator
>symchk /om calcsyms.txt /ip 1772
SYMCHK: GdiPlus.dll FAILED - MicrosoftWindowsGdiPlus-
1.1.7601.17514-gdiplus.pdb mismatched or not found
SYMCHK: FAILED files = 1
SYMCHK: PASSED + IGNORED files = 27
>head -n 4 calcsyms.txt
calc.pdb,971D2945E998438C847643A9DB39C88E2,1
calc.exe,4ce7979dc0000,1
ntdll.pdb,120028FA453F4CD5A6A404EC37396A582,1
ntdll.dll,4ce7b96e13c000,1
>tail -n 4 calcsyms.txt
CLBCatQ.pdb,00A720C79BAC402295B6EBDC147257182,1
clbcatq.dll,4a5bd9b183000,1
oleacc.pdb,67620D076A2E43C5A18ECD5AF77AADBE2,1
oleacc.dll,4a5bdac83c000,1
so assuming you have fetched the symbols it would be easier to rerun the tests with a locally cached copies of the symbol files
if you have fetched the symbols but you cannot rerun the tests and have to work solely with the output from results.txt you have some text processing work (sed . grep , awk . or custom parser)
the drmemory suite comes with a symbolquery.exe in the bin folder and it can be used to resolve the symbols from results.txt
in the example above you can notice the offset relative to modulebase like
0x1c4e7 in the line # 6 USER32.dll!gapfnScSendMessage +0x1ce (0x75fdc4e7 {USER32.dll+0x1c4e7})
so for each line in results.txt you have to parse out the offset and invoke symbolquery on the module like below
:\>symquery.exe -f -e c:\Windows\System32\user32.dll -a +0x1c4e7
InternalCallWinProc+0x23
??:0
:\>symquery.exe -f -e c:\Windows\System32\user32.dll -a +0x1c5e7
UserCallWinProcCheckWow+0xb3
a simple test processing example from a result.txt and a trimmed output
:\>grep "^#" results.txt | sed s/".*<"//g
# 0 system call NtUserBuildPropList parameter #2
USER32.dll+0x649d9>)
snip
COMCTL32.dll+0x2f443>)
notice the comctl32.dll (there is a default comctl.dll in system32.dll and several others in winsxs you have to consult the other files like global.log to view the dll load path
symquery.exe -f -e c:\Windows\winsxs\x86_microsoft.windows.common-
controls_6595b64144ccf1df_6.0.7601.17514_none_41e6975e2bd6f2b2\comctl32.dll -a +0x2f443
CallOriginalWndProc+0x1a
??:0
symquery.exe -f -e c:\Windows\system32\comctl32.dll -a +0x2f443
DrawInsert+0x120 <----- wrong symbol due to wrong module (late binding
/forwarded xxx yyy reasons)

Resources