Use wildcard on params - bioinformatics

Use wildcard on params - bioinformatics

I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.

The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)

You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""

Related

Snakemake Ambiguity between 2 rules with similar output file type

I am trying to map 3 samples: SRR14724459, SRR14724473, and a combination of both SRR14724459_SRR14724473.
I have 2 rules that share a similar file type output (.bam), and even tho I am naming their wildcards different, I still get an ambiguity error:
Building DAG of jobs...
AmbiguousRuleException:
Rules map_hybrid and map are ambiguous for the file /gpfs/scratch/hpadre/snakemake_outputs/mapped_dir/SRR14724459_SRR14724473__0.9.bam.
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473__0.9/SRR14724459_SRR14724473__0.9_R2_trimmed.fastq
From my Snakefile, here are my variables:
all_samples: ['SRR14724459', 'SRR14724473']
sample1: ['SRR14724459']
sample2: ['SRR14724473']
titration:[0.9]
This is my rule all:
rule all:
expand(MAPPED_DIR + "/{sample}.bam", sample=all_samples),
expand(MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam", zip, sample1=list_a_titrations, sample2=list_b_titrations, titration=tit_list)
This is my map rule:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
This is my map_hybrid:
rule map_hybrid:
input:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R1_trimmed_{sample2}_R1_trimmed_{titration}.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_{titration}/{sample1}_R2_trimmed_{sample2}_R2_trimmed_{titration}.fastq"
output:
MAPPED_DIR + "/{sample1}_{sample2}_{titration}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample1}_{sample2}_{titration}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample1}_{sample2}_{titration}_bwa_benchmark.txt"
shell:
"""
set +e
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
exitcode=$?
if [ $exitcode -eq 1 ]
then
exit 1
else
exit 0
fi
"""
The expected input files SHOULD BE as so:
Expected input files:
map_hybrid: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R1_trimmed_SRR14724473_R1_trimmed_0.9.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_SRR14724473_0.9/SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq
map: /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724459_R2_trimmed.fastq
and also
/home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R1_trimmed.fastq /home/hpadre/ngs_artifacts_proj/output_directories/trimmed_dir/SRR14724473_R2_trimmed.fastq

Your rules map and map_hybrid can both produce your desired files, for snakemake they are ambigious rules.
The names of the wildcards is irrelevant, what is relevant is whether the wildcards in both rules can match the same output filepath.
That is the case here.
While rule map_hybrid can produce the output file SRR14724459_R2_trimmed_SRR14724473_R2_trimmed_0.9.fastq where the wildcard matches are
sample1=SRR14724459
sample2=SRR14724473
the rule map can also produce this output with the wildcard match
sample=SRR14724459_R2_trimmed_SRR14724473
To prevent ambiguity you can use the wildcard_constraint, so that the {sample} wildcard only matches strings starting with SRR followed by numbers:
sample='SRR\d+'
Integrated into your rule map:
rule map:
input:
r1 = TRIMMED_DIR + "/{sample}/{sample}_R1_trimmed.fastq",
r2 = TRIMMED_DIR + "/{sample}/{sample}_R2_trimmed.fastq"
output:
MAPPED_DIR + "/{sample}.bam"
threads: 28
params:
genome = HUMAN_GENOME_DIR
log:
LOG_DIR + "/map/{sample}_map.log"
benchmark:
BENCHMARK_DIR + "/map/{sample}_bwa_benchmark.txt"
wildcard_constraints:
word='[^0-9]*',
sample='SRR\d+'
shell:
"""
bwa mem -t {threads} {params.genome} {input.r1} {input.r2} 2> {log} | samtools view -hSbo > {output}
"""
it should resolve the ambiguity.

GNU parallel environmental variables separated from input

I have a problem which I think should be very easy to fix, but I for some reason I cant solve it..
I am using GNU parallel inside of Snakemake to perform some variant calling.
My input file (contigs.txt) looks something like this:
GL000207.1
GL000226.1
GL000229.1
GL000231.1
GL000210.1
GL000239.1
GL000235.1
The command which ultimatly gets executes looks something like this:
"name=$(bash get_name.sh normal_recal.bam) \n"
"cat {input.contigs} | "
"env_parallel --env name --jobs {threads} "
"'({input.GATK} Mutect2 -I {input.tumor} -I {input.normal} "
"-R {input.reference} -normal $name "
"--native-pair-hmm-threads 4 "
"-L {{}} --germline-resource {input.gnomAD} "
"-O {input.path}/{wildcards.sample_id}/{{}}.somatic.vcf "
"--f1r2-tar-gz {input.path}/{wildcards.sample_id}/{{}}.f1r2.tar.gz) &> {log.err}.{{}}.err' \n"
If I execute this script in my interactive terminal every thing works as expected. However when I am trying to execute it inside snakemake, the program just ends without an error. I think I have tracked down what the problem is using this simple example:
cat contigs | env_parallel --env name --jobs 4 'echo $name'
This prints something like this:
sample_tumor GL000207.1
sample_tumor GL000226.1
sample_tumor GL000229.1
sample_tumor GL000231.1
I think that the problem is, that the input variable also gets passed to the $name variable, which causes the program to crash.
I was wondering how I can achieve, that only the the actual $name (i.e. here sample_tumor) gets passed. So the output of the above example would be
sample_tumor
sample_tumor
sample_tumor
sample_tumor
Cheers!

It turns out my guess regarding the error was wrong. I think I miss used env_parallel.
This code solved my problem.
"name=$(bash get_name.sh normal_recal.bam) \n"
"export name \n"
"cat {input.contigs} | "
"parallel --jobs {threads} "
"'({input.GATK} Mutect2 -I {input.tumor} -I {input.normal} "
"-R {input.reference} -normal $name "
"--native-pair-hmm-threads 4 "
"-L {{}} --germline-resource {input.gnomAD} "
"-O {input.path}/{wildcards.sample_id}/{{}}.somatic.vcf "
"--f1r2-tar-gz {input.path}/{wildcards.sample_id}/{{}}.f1r2.tar.gz) &> {log.err}.{{}}.err'"

Using bash functions in snakemake

I am trying to download some files with snakemake. The files (http://snpeff.sourceforge.net/SnpSift.html#dbNSFP) I would like to download are on a google site/drive and my usual wget approach does not work. I found a bash function that does the job (https://www.zachpfeffer.com/single-post/wget-a-Google-Drive-file):
function gdrive_download () { CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p') wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2 rm -rf /tmp/cookies.txt }
gdrive_download 120aPYqveqPx6jtssMEnLoqY0kCgVdR2fgMpb8FhFNHo test.txt
I have tested this function with my ids in a plain bash script and was able to download all the files. To add a bit to the complexity, I must use a workplace template, and incorporate the function into it.
rule dl:
params:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_{genome}/{afile}'
output:
'data/{genome}/{afile}'
params:
id1 = '0B7Ms5xMSFMYlOTV5RllpRjNHU2s',
f1 = 'dbNSFP.txt.gz'
shell:
"""CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id={{params.id1}}" -O- | sed -rn "s/.*confirm=([0-9A-Za-z_]+).*/\1\n/p") && wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id={{params.id1}}" -O {{params.f1}} && rm -rf /tmp/cookies.txt"""
#'wget -c {params.url} -O {output}'
rule checksum:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.md5')
shell:
'md5sum {input} > {output}'
rule file_size:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.size')
shell:
'du -csh --apparent-size {input} > {output}'
rule file_info:
"""md5 checksum and file size"""
input:
md5 = 'tmp/{genome}/{afile}.md5',
s = 'tmp/{genome}/{afile}.size'
output:
o = temp('tmp/{genome}/info/{afile}.csv')
run:
with open(input.md5) as f:
md5, fp = f.readline().strip().split()
with open(input.s) as f:
size = f.readline().split()[0]
with open(output.o, 'w') as fout:
print('filepath,size,md5', file=fout)
print(f"{fp},{size},{md5}", file=fout)
rule manifest:
input:
expand('tmp/{genome}/info/{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
#expand('tmp/{genome}/info/SnpSift{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
output:
o = 'MANIFEST.csv'
run:
pd.concat([pd.read_csv(afile) for afile in input]).to_csv(output.o, index=False)
There are four downloadable files for which I have ids (I only show one in params), however I don't know how to call the bash functions as written by ZPfeffer for all the ids I have with snakemake. Additionally, when I run this script, there are several errors, the most pressing being
sed: -e expression #1, char 31: unterminated `s' command
I am far from a snakemake expert, any assistance on how to modify my script to a) call the functions with 4 different ids, b) remove the sed error, and c) verify whether this is the correct url format (currently url = 'https://docs.google.com/uc?export/{afile}) will be greatly appreciated.

You would want to use raw string literal so that snakemake doesn't escape special characters, such as backslash in sed command. For example (notice r in front of shell command):
rule foo:
shell:
r"sed d\s\"
You could use --printshellcmds or -p to see how exactly shell: commands get resolved by snakemake.

Here is how I "solved" it:
import pandas as pd
rule dl:
output:
'data/{genome}/{afile}'
shell:
"sh download_snpsift.sh"
rule checksum:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.md5')
shell:
'md5sum {input} > {output}'
rule file_size:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.size')
shell:
'du -csh --apparent-size {input} > {output}'
rule file_info:
"""md5 checksum and file size"""
input:
md5 = 'tmp/{genome}/{afile}.md5',
s = 'tmp/{genome}/{afile}.size'
output:
o = temp('tmp/{genome}/info/{afile}.csv')
run:
with open(input.md5) as f:
md5, fp = f.readline().strip().split()
with open(input.s) as f:
size = f.readline().split()[0]
with open(output.o, 'w') as fout:
print('filepath,size,md5', file=fout)
print(f"{fp},{size},{md5}", file=fout)
rule manifest:
input:
expand('tmp/{genome}/info/{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
output:
o = 'MANIFEST.csv'
run:
pd.concat([pd.read_csv(afile) for afile in input]).to_csv(output.o, index=False)
And here is the bash script.
function gdrive_download () {
CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2
rm -rf /tmp/cookies.txt
}
gdrive_download 0B7Ms5xMSFMYlSTY5dDJjcHVRZ3M data/GRCh37/dbNSFP.txt.gz
gdrive_download 0B7Ms5xMSFMYlOTV5RllpRjNHU2s data/GRCh37/dbNSFP.txt.gz.tbi
gdrive_download 0B7Ms5xMSFMYlbTZodjlGUDZnTGc data/GRCh38/dbNSFP.txt.gz
gdrive_download 0B7Ms5xMSFMYlNVBJdFA5cFZRYkE data/GRCh38/dbNSFP.txt.gz.tbi

Multiple Processes - Python

I am looking to run multiple instances of a command line script at the same time. I am new to this concept of "multi-threading" so am at bit of a loss as to why I am seeing the things that I am seeing.
I have tried to execute the sub-processing in two different ways:
1 - Using multiple calls of Popen without a communicate until the end:
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum1 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum1{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum2 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum2{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum3 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum3{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
(stdoutdata, stderrdata) = process.communicate()
this starts up each of the command line item but only completes the last entry leaving the other 2 hanging.
2 - Attempting to implement an example from Python threading multiple bash subprocesses? but nothing happens except for a printout of the commands (program hangs with no command line arguments running as observed in windows task manager:
import threading
import Queue
import commands
import time
workspace = r'F:\Processing\SM'
image = 't08r_e'
image_name = (image.split('.'))[0]
i = 0
process_image_tif = workspace + '\\{}{}.tif'.format((image.split('r'))[0], str(i))
# thread class to run a command
class ExampleThread(threading.Thread):
def __init__(self, cmd, queue):
threading.Thread.__init__(self)
self.cmd = cmd
self.queue = queue
def run(self):
# execute the command, queue the result
(status, output) = commands.getstatusoutput(self.cmd)
self.queue.put((self.cmd, output, status))
# queue where results are placed
result_queue = Queue.Queue()
# define the commands to be run in parallel, run them
cmds = ['raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum1 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum1{}'.format(i), str(i)),
'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum2 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum2{}'.format(i), str(i)),
'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum3 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum3{}'.format(i), str(i)),
]
for cmd in cmds:
thread = ExampleThread(cmd, result_queue)
thread.start()
# print results as we get them
while threading.active_count() > 1 or not result_queue.empty():
while not result_queue.empty():
(cmd, output, status) = result_queue.get()
print(cmd)
print(output)
How can I run all of these commands at the same time achieving a result at the end? I am running in windows, pyhton 2.7.

My first try didn't work because of the repeated definitions of stdout and sterror. Removing these definitions causes expected behavior.

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")

I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Use wildcard on params - bioinformatics

Related

Snakemake Ambiguity between 2 rules with similar output file type

GNU parallel environmental variables separated from input

Using bash functions in snakemake

Multiple Processes - Python

snakemake how to encode pair analisys

Categories

Resources