Can Snakemake run cluster jobs on a user-defined subset of data? - cluster-computing

I was wondering whether there is any possibility to create cluster jobs with snakemake which will run a batch of tasks.
Assuming the following input:
image_00.tif
image_01.tif
...
image_99.tif
I would like to create the following output:
meas_image_00.txt
meas_image_02.txt
...
meas_image_99.txt
But instead of creating 100 jobs, I would like to submit 5 jobs with 20 input files each, creating 20 output files.
I thought about the group flag but I am not sure whether this is really doing what I want.
The only other way I can imagine is to additionally create a batch file as output and ignore that there will be per image outputs. The expected files then would be something like this:
batch_1.summary
batch_2.summary
batch_3.summary
batch_4.summary
batch_5.summary
But I would loose control if the meas_image*.txt are created at all.
Does anybody have an advice how to solve that?
Edit:
I played around with an example using rules generated by loops. I just concentrated on creating output in batches for now but something goes wrong. This is my snakefile:
data=list(range(0,100))
samples = ["image_" + str(i).zfill(2) for i in data]
batch_size = 5
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def chunklist():
chunk_dict = {}
for i, chunk in enumerate(chunks(samples, batch_size)):
chunk_dict[i] = chunk
return chunk_dict
cd = chunklist()
rule all:
input: 'rule_3.out'
for i,chunk in cd.items():
rule: # batch rule
name: "my_rule_{}".format(i)
output: expand("{sample}.txt", sample=chunk)
group: 'batch_group'
run:
print(output)
print(chunk)
for file in chunk:
print("Creating {} of {}".format(file, chunk))
touch('{}.txt'.format(file))
for i in cd:
rule: # batch rule
name: "batch_rule_{}".format(i)
input: expand('{sample}.txt', sample=cd[i])
group: 'batch_group'
output: touch('{}.group'.format(i))
rule rule3:
input:
files=expand('{sample}.txt', sample=samples),
batches=expand('{i}.group', i=range(len(samples) // batch_size - 1)) # should be ceil in general
output:
'rule_3.out'
shell: 'cat {input.files} > {output}'
And this is the problem, I get:
rule my_rule_10:
output: image_50.txt, image_51.txt, image_52.txt, image_53.txt, image_54.txt
jobid: 12
reason: Missing output files: image_51.txt, image_54.txt, image_53.txt, image_52.txt, image_50.txt
resources: tmpdir=/tmp
image_50.txt image_51.txt image_52.txt image_53.txt image_54.txt
['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
Creating image_95 of ['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
Creating image_96 of ['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
Creating image_97 of ['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
Creating image_98 of ['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
Creating image_99 of ['image_95', 'image_96', 'image_97', 'image_98', 'image_99']
It's strange to me as output is generated by chunk but then differs from the variable chunk...
I have no clue what goes wrong. This example should be executable for everybody. I would be happy to get help.

Related

snakemake 6.0.5: Input a list of folders and multiple files from each folder (to merge Lanes)

Good day. I have some directories (shown in bold below) each having some .fastq files for different lanes.
CND1/ UD_LOO3_R1.fastq.gz UD_LOO4_R1.fastq.gz
CND2/ XD_L001_R1.fastq.gz XD_L004_R1.fastq.gz
Inside each directory, i want to create a merged fastq file that will be named as : sample_R1.fastq.gz. For instance, CND1/UD_R1.fastq.gz and CND2/XD_R1.fastq.gz etc and so on. To this end, i created the following snakemake workflow.
from collections import defaultdict
dirs,samp,lane = glob_wildcards("dir}/{sample}_L{lane}_R1.fastq.gz")
dirs, fls = glob_wildcards("{dir}/{files}_R1.fastq.gz")
D = defaultdict(list)
for x,y in zip(dirs,fls):
D[x].append(y+'_R1.fastq.gz')
rule ML:
message:
"Merge all Lanes for Fragment R1"
input:
expand( "{dir}/{files}",zip,dir=D.keys(),files=D.values() )
output:
expand( "{dir}/{s}_R1.fastq.gz",zip,dir=dirs,s=set(samp) )
shell:
"echo {input} && echo {output} "
#"zcat {input} >> {output}"
In the code above, dict D contains directories as keys and list of fastq as values.
{ 'CND2': ['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz'], 'CND1': ['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz'] }
Doing a dry-run, snakemake complains me of missing files as follows
Missing input files for rule ML:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
I want to understand what is the correct way to provide both a directory and a list of files as input together. Any help is greatly appreciated.
Thanks.
The error clearly explains the issue. As a result of the expand function, your input is two files:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
If you need the input to be four files like that:
CND1/UD_LOO3_R1.fastq.gz
CND1/UD_LOO4_R1.fastq.gz
CND2/XD_L001_R1.fastq.gz
CND2/XD_L004_R1.fastq.gz
you need to flatten your dictionary:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{dir}/{files}", zip, dir=[row[0] for row in inputs], files=[row[1] for row in inputs])
or alternatively:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{filename}", filename=[f"{dir}/{file}" for dir, file in inputs])
Overall you are overcomplicating the problem. The same can be done without this ugly juggling the lists of tuples. glob_wildcards("{filename_base}_R1.fastq.gz") should return you more convenient representation.

Error with Condor: "$INT() macro: 50+ $((0/41)) does not evaluate to an integer!"

I want to run several jobs with Condor, my executable take as an argument b such that: b1=50+ $(($(Process)/41)), where $(())stands for the quotient of $(Process) divided by 41. b is defined in quotient.sh. Here is my submit file:
# Unix submit description file
include : PATH/quotient.sh
executable = PATH/script_test.sh
arguments = $(b) $(Process)
log = fit_it_data_$INT(b)_$(Process).log
output = outfile_fit_$INT(b)_$(Process).txt
error = errors_fit_$INT(b)_$(Process).txt
transfer_input_files = PATH
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
queue 81
However I am getting the error Submitting job(s)ERROR at Queue statement on Line 13: $INT() macro: 50+ $((0/41)) does not evaluate to an integer!. I don't understand why it complains that is does not evaluate to an integer, since b should be equal to 50 here...
Any idea how to fix that issue?
b1=50+ $(($(Process)/41))
I think you have an extra "$" in there. Try this:
b1=50+ ($(Process)/41)

Snakemake - parameter file treated as a wildcard

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).
working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}
Tha all rule looks as follows:
configfile: "config.yaml"
SAMPLES_DICT = dict()
with open(config['SAMPLE_SHEET'], "r+") as fil:
next(fil)
for lin in fil.readlines():
row = lin.strip("\n").split("\t")
sample_id = row[0]
sample_name = row[1]
if sample_name in SAMPLES_DICT.keys():
SAMPLES_DICT[sample_name].append(sample_id)
else:
SAMPLES_DICT[sample_name] = [sample_id]
SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
input:
# FASTQC output for RAW reads
expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Trimming
expand(os.path.join(config['TRIMMED'],
'{sample_id}_R{read}_val_{read}.fq.gz'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Alignment
expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
sample_id = SAMPLE_IDS),
# Merging
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
sample = SAMPLES),
# Marking Duplicates
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
sample = SAMPLES),
# Filtering
expand(os.path.join(config['FILTERED'],
'{sample}.bam'),
sample = SAMPLES),
expand(os.path.join(config['FILTERED'],
'{sample}.bam.bai'),
sample = SAMPLES),
# multiqc report
"multiqc_report.html"
message:
'\n#################### ATAC-seq pipeline #####################\n'
'Running all necessary rules to produce complete output.\n'
'############################################################'
I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete.
This is working, to my knowledge exactly as I want.
However, I added a rule:
rule hmmratac:
input:
bam = os.path.join(config['FILTERED'], '{sample}.bam'),
index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
output:
model = os.path.join(config['HMMRATAC'], '{sample}.model'),
gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
sample_name = '{sample}'
log:
os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
params:
genomes = config['GENOMES'],
blacklisted = config['BLACKLIST']
resources:
mem_mb = 32000
message:
'\n######################### Peak calling ########################\n'
'Peak calling for {output.sample_name}\n.'
'############################################################'
shell:
'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
'--bam {input.bam} --index {input.index} '
'--genome {params.genome} --blacklist {params.blacklisted} '
'--output {output.sample_name} --bedgraph true &> {log}'
And into the rule all, after filtering, before multiqc, I added:
# Peak calling
expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
sample = SAMPLES),
Relevant config.yaml fragments:
# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"
# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"
# Path to filtered alignment
FILTERED: "alignment/filtered"
# Path to peaks
HMMRATAC: "peaks/hmmratac"
This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.
File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
It isn't actually ... - I just didn't feel safe providing an absolute path here.
For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.
Any suggestions as to how to google it, where to look for an error would be much appreciated.
Technically it's a warning but it halts execution of snakemake so I am calling it an error.
It would be useful to post the logs from snakemake to see if snakemake terminated with an error and if so what error.
However, in addition to Eric C.'s suggestion to use wildcards.sample instead of {sample} as file name, I think that this is quite suspicious:
alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam
/mnt/ is usually at the root of the file system and you are prepending to it a relative path (alignment/filtered). Are you sure it is correct?

Snakemake, how to change output filename when using wildcards

I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"

mapping reads using snakemake

I am trying to run hisat2 mapping using the snakemake.
Basically, I'm using a config.yaml file like this:
reads:
set1: /path/to/set1/samplelist.tab
hisat2:
database: genome
genome: genome.fa
nodes: 2
memory: 8G
arguments: --dta
executables:
hisat2: /Tools/hisat2-2.1.0/hisat2
samtools: /Tools/samtools-1.3/samtools
Then Snakefile:
configfile: "config.yaml"
workdir: "/path/to/working_dir/"
# Hisat2
rule hisat2:
input:
reads = lambda wildcards: config["reads"][wildcards.sample]
output:
bam = "{sample}/{sample}.bam"
params:
idx=config["hisat2"]["database"],
executable = config["executables"]["hisat2"],
nodes = config["hisat2"]["nodes"],
memory = config["hisat2"]["memory"],
executable2 = config["executables"]["samtools"]
run:
shell("{params.executable} --dta -p {params.nodes} -x {params.idx} {input.reads} |"
"{params.executable2} view -Sbh -o {output.bam} -")
# all
rule all:
input:
lambda wildcards: [sample + "/" + sample + ".bam"
for sample in config["reads"].keys()]
My samplelist.tab is like this:
id reads1 reads2
set1a set1a_R1.fastq.gz set1a_R2.fastq.gz
set1b set1b_R1.fastq.gz set1b_R2.fastq.gz
Any hints how to make this working? I apoligize for a messy script, just started using snakemake.
You will have to do something like this:
import pandas as pd
reads = pd.read_csv(config["reads"]['set1'], sep='\t', index_col=0)
def get_fastq(wildcards):
return list(reads.loc[wildcards.sample].values)
rule hisat2:
input:
get_fastq
...
First you will need to load the samplelist and store this (I did it as a pandas dataframe). Then you can lookup which files belong to that sample name.
Edit:
Rewriting the code to look like this is much more readable (in my opinion).
rule hisat2:
input:
[{sample}_R1.fastq.gz,
{sample}_R2.fastq.gz]
...

Resources