snakemake program not running all functions? - shell

configfile: "config.yaml"
DATA = config['DATA_DIR']
BIN = config['BIN']
JASPAR = config['DATA_DIR']
RESULTS = config['RESULTS']
# JASPAR = "{0}/JASPAR2020".format(DATA)
JASPARS, ASSEMBLIES, BATCHES, TFS, BEDS = glob_wildcards(os.path.join(DATA, "{jaspar}", "{assembly}", "TFs", "{batch}", "{tf}", "{bed}.bed"))
rule all:
input:
expand (os.path.join(RESULTS, "{jaspar}", "{assembly}", "LOLA_dbs", "JASPAR2020_LOLA_{batch}.RDS"), jaspar = JASPARS, assembly = ASSEMBLIES, batch = BATCHES)
rule createdb:
input:
files = expand(os.path.join(RESULTS, "{jaspar}", "{assembly}", "data", "{batch}", "{tf}", "regions", "{bed}.bed"), zip, jaspar = JASPARS, assembly = ASSEMBLIES, batch = BATCHES, tf = TFS, bed = BEDS)
output:
os.path.join(RESULTS, "{jaspar}", "{assembly}", "LOLA_dbs", "JASPAR2020_LOLA_{batch}.RDS")
shell:
"""
R --vanilla --slave --silent -f {BIN}/create_lola_db.R \
--args {RESULTS}/{wildcards.jaspar}/{wildcards.assembly}/data/{wildcards.batch} \
{output};
"""
Why my snakemake program is not considering "createdb" rule. It is only considering "all". Can someone please help me with this?

"snakemake only runs the first rule in a Snakefile, by default." - SOURCE
If that first rule's input isn't clearly linked to createdb, it's not going to know to run createdb. Because the only rule it wants to run would have to depend on createdb.
I suspect your problem is the input to your rule createdb shouldn't have expand() used for all the wildcards already handled by rule all, just bed?. See here.
Also other avenues to consider for debugging is investigating what you unpacked glob_wildcards() to and what files is.

Related

snakemake 6.0.5: Input a list of folders and multiple files from each folder (to merge Lanes)

Good day. I have some directories (shown in bold below) each having some .fastq files for different lanes.
CND1/ UD_LOO3_R1.fastq.gz UD_LOO4_R1.fastq.gz
CND2/ XD_L001_R1.fastq.gz XD_L004_R1.fastq.gz
Inside each directory, i want to create a merged fastq file that will be named as : sample_R1.fastq.gz. For instance, CND1/UD_R1.fastq.gz and CND2/XD_R1.fastq.gz etc and so on. To this end, i created the following snakemake workflow.
from collections import defaultdict
dirs,samp,lane = glob_wildcards("dir}/{sample}_L{lane}_R1.fastq.gz")
dirs, fls = glob_wildcards("{dir}/{files}_R1.fastq.gz")
D = defaultdict(list)
for x,y in zip(dirs,fls):
D[x].append(y+'_R1.fastq.gz')
rule ML:
message:
"Merge all Lanes for Fragment R1"
input:
expand( "{dir}/{files}",zip,dir=D.keys(),files=D.values() )
output:
expand( "{dir}/{s}_R1.fastq.gz",zip,dir=dirs,s=set(samp) )
shell:
"echo {input} && echo {output} "
#"zcat {input} >> {output}"
In the code above, dict D contains directories as keys and list of fastq as values.
{ 'CND2': ['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz'], 'CND1': ['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz'] }
Doing a dry-run, snakemake complains me of missing files as follows
Missing input files for rule ML:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
I want to understand what is the correct way to provide both a directory and a list of files as input together. Any help is greatly appreciated.
Thanks.
The error clearly explains the issue. As a result of the expand function, your input is two files:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
If you need the input to be four files like that:
CND1/UD_LOO3_R1.fastq.gz
CND1/UD_LOO4_R1.fastq.gz
CND2/XD_L001_R1.fastq.gz
CND2/XD_L004_R1.fastq.gz
you need to flatten your dictionary:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{dir}/{files}", zip, dir=[row[0] for row in inputs], files=[row[1] for row in inputs])
or alternatively:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{filename}", filename=[f"{dir}/{file}" for dir, file in inputs])
Overall you are overcomplicating the problem. The same can be done without this ugly juggling the lists of tuples. glob_wildcards("{filename_base}_R1.fastq.gz") should return you more convenient representation.

Snakemake - parameter file treated as a wildcard

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).
working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}
Tha all rule looks as follows:
configfile: "config.yaml"
SAMPLES_DICT = dict()
with open(config['SAMPLE_SHEET'], "r+") as fil:
next(fil)
for lin in fil.readlines():
row = lin.strip("\n").split("\t")
sample_id = row[0]
sample_name = row[1]
if sample_name in SAMPLES_DICT.keys():
SAMPLES_DICT[sample_name].append(sample_id)
else:
SAMPLES_DICT[sample_name] = [sample_id]
SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
input:
# FASTQC output for RAW reads
expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Trimming
expand(os.path.join(config['TRIMMED'],
'{sample_id}_R{read}_val_{read}.fq.gz'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Alignment
expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
sample_id = SAMPLE_IDS),
# Merging
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
sample = SAMPLES),
# Marking Duplicates
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
sample = SAMPLES),
# Filtering
expand(os.path.join(config['FILTERED'],
'{sample}.bam'),
sample = SAMPLES),
expand(os.path.join(config['FILTERED'],
'{sample}.bam.bai'),
sample = SAMPLES),
# multiqc report
"multiqc_report.html"
message:
'\n#################### ATAC-seq pipeline #####################\n'
'Running all necessary rules to produce complete output.\n'
'############################################################'
I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete.
This is working, to my knowledge exactly as I want.
However, I added a rule:
rule hmmratac:
input:
bam = os.path.join(config['FILTERED'], '{sample}.bam'),
index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
output:
model = os.path.join(config['HMMRATAC'], '{sample}.model'),
gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
sample_name = '{sample}'
log:
os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
params:
genomes = config['GENOMES'],
blacklisted = config['BLACKLIST']
resources:
mem_mb = 32000
message:
'\n######################### Peak calling ########################\n'
'Peak calling for {output.sample_name}\n.'
'############################################################'
shell:
'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
'--bam {input.bam} --index {input.index} '
'--genome {params.genome} --blacklist {params.blacklisted} '
'--output {output.sample_name} --bedgraph true &> {log}'
And into the rule all, after filtering, before multiqc, I added:
# Peak calling
expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
sample = SAMPLES),
Relevant config.yaml fragments:
# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"
# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"
# Path to filtered alignment
FILTERED: "alignment/filtered"
# Path to peaks
HMMRATAC: "peaks/hmmratac"
This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.
File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
It isn't actually ... - I just didn't feel safe providing an absolute path here.
For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.
Any suggestions as to how to google it, where to look for an error would be much appreciated.
Technically it's a warning but it halts execution of snakemake so I am calling it an error.
It would be useful to post the logs from snakemake to see if snakemake terminated with an error and if so what error.
However, in addition to Eric C.'s suggestion to use wildcards.sample instead of {sample} as file name, I think that this is quite suspicious:
alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam
/mnt/ is usually at the root of the file system and you are prepending to it a relative path (alignment/filtered). Are you sure it is correct?

Yocto How to overwrite a file of linux rootfs depending on an IMAGE-recipe?

I'm trying to add a simple line in fstab within
the final rootfs that Yocto builds.
My first approach was to add my own fstab in my layer meta-mylayer/recipes-core/base-files/base-files/fstab and the proper meta-mylayer/recipes-core/base-files/base-files/base-files_%.bbappend which only have the following line:
FILESEXTRAPATHS_prepend := "${THISDIR}/${PN}:"
And it works, but as the title of my question says, i want to modify fstab based on the recipe-image i want to build i.e. dev-image & prod-image.
After some investigation i think i have 2 options
Modify fstab within the recipe image, extending the do_install task...
dev-image.bb
--------------
DESCRIPTION = "Development Image"
[...]
inherit core-image
do_install_append () {
echo "======= Modifying fstab ========"
cat >> ${D}${sysconfdir}/fstab <<EOF
# The line i want to Add
EOF
}
[...]
--------------
Problem is that i'm actually not seeing my modified line in my final /etc/fstab and bitbake is not showing any build error or warning about this, actually, i'm not even able to see the echo-trace i put.
My second attempt was to handle these modifications with packages and depending on the recipe-image i will be able to add the package for *-dev or *-prod. This idea was taken from Oleksandr Poznyak in this answer in summary he suggest the following:
1) Create *.bbappend recipe base-files_%s.bbappend in your layer. It
appends to poky "base-files" recipe.
2) Create your own "python do_package_prepend" function where you should
make your recipe produce two different packages
3) Add them to DEPENDS in your image recipe
And based on his example i made my own recipe:
base-files_%.bbappend
-------------------------
FILESEXTRAPATHS_prepend := "${THISDIR}/${PN}:"
SRC_URI += "file://fstab-dev \
file://fstab-prod \
"
PACKAGES += " ${PN}-dev ${PN}-prod"
CONFFILES_${PN}-dev = "${CONFFILES_${PN}}"
CONFFILES_${PN}-prod = "${CONFFILES_${PN}}"
pkg_preinst_${PN}-dev = "${pkg_preinst_${PN}}"
pkg_preinst_${PN}-prod = "${pkg_preinst_${PN}}"
RREPLACES_${PN}-dev = "${PN}"
RPROVIDES_${PN}-dev = "${PN}"
RCONFLICTS_${PN}-dev = "${PN}"
RREPLACES_${PN}-prod = "${PN}"
RPROVIDES_${PN}-prod = "${PN}"
RCONFLICTS_${PN}-prod = "${PN}"
python populate_packages_prepend() {
import shutil
packages = ("${PN}-dev", "${PN}-prod")
for package in packages:
# copy ${PN} content to packages
shutil.copytree("${PKGD}", "${PKGDEST}/%s" % package, symlinks=True)
# replace fstab
if package == "${PN}-dev":
shutil.copy("${WORKDIR}/fstab-dev", "${PKGDEST}/${PN}-dev/etc/fstab")
else:
shutil.copy("${WORKDIR}/fstab-prod", "${PKGDEST}/${PN}-prod/etc/fstab")
}
-------------------------
And in my recipe-image(dev-image.bb) i added base-files-dev packet
dev-image.bb
--------------
DESCRIPTION = "Development Image"
[...]
inherit core-image
IMAGE_INSTALL = " \
${MY_PACKETS} \
base-files-dev \
"
[...]
--------------
Problem with this, is that i'm not familiarized with phyton indentation so probably i'm messing things up, the error log shows as follows.
DEBUG: Executing python function populate_packages
ERROR: Error executing a python function in exec_python_func() autogenerated:
The stack trace of python calls that resulted in this exception/failure was:
File: 'exec_python_func() autogenerated', lineno: 2, function: <module>
0001:
*** 0002:populate_packages(d)
0003:
File: '/home/build/share/build_2/../sources/poky/meta/classes/package.bbclass', lineno: 1138, function: populate_packages
1134:
1135: workdir = d.getVar('WORKDIR')
1136: outdir = d.getVar('DEPLOY_DIR')
1137: dvar = d.getVar('PKGD')
*** 1138: packages = d.getVar('PACKAGES').split()
1139: pn = d.getVar('PN')
1140:
1141: bb.utils.mkdirhier(outdir)
1142: os.chdir(dvar)
File: '/usr/lib/python3.6/shutil.py', lineno: 315, function: copytree
0311: destination path as arguments. By default, copy2() is used, but any
0312: function that supports the same signature (like copy()) can be used.
0313:
0314: """
*** 0315: names = os.listdir(src)
0316: if ignore is not None:
0317: ignored_names = ignore(src, names)
0318: else:
0319: ignored_names = set()
Exception: FileNotFoundError: [Errno 2] No such file or directory: '${PKGD}'
DEBUG: Python function populate_packages finished
DEBUG: Python function do_package finished
I will really appreciate any clue or sort of direction, i'm not an Yocto expert so maybe the options that i suggest are not the most elegant and probably there is a better way to do it, so be free to give me any recommendation.
Thank you very much.
UPDATE:
As always, i was not the only one trying this, the way that i make it work was thanks this answer the only inconvenience with this is that you need to rm what you want to install through a .bbappend but for now is fine for me.
I also tried to do the same with bbclasses, which for me, it is a more elegant wayto do it, but i failed... i got the following error
ERROR: base-files-dev-3.0.14-r89 do_packagedata: The recipe base-files-dev is trying to install files into a shared area when those files already exist. Those files and their manifest location are:
I tried to rm fstab within the .bbappend but the same error is showed
Maybe somebody will share what i'm doing wrong...
If you don't find this post valuable please remove...
Your recipe which base on Oleksandr doesn't work due to dropped support for variables expansion in newer Poky.
https://www.yoctoproject.org/docs/latest/mega-manual/mega-manual.html#migration-2.1-variable-expansion-in-python-functions
Error explicit says:
Exception: FileNotFoundError: [Errno 2] No such file or directory: '${PKGD}'
It didn't expand the variable.
P.S.
This is not a proper answer to Your question but SO blocks comments.

Snakemake, how to change output filename when using wildcards

I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"

Including/excluding entire groups from targets

My project has a group with a few hundred files (organized into a couple dozen subgroups two levels deep). The files in that group are themselves being changed reasonably often. I want those files to be included in some targets, but not others.
In Xcode 3.x, after each change to the group, I would just Get Info on the group itself, go to the Targets tab, and (re-)select the targets I wanted. (This was, in fact, the answer to a nearly-identical question from 2010, Xcode — groups and targets.)
In Xcode 5, the equivalent File Inspector panel doesn't have a Target Membership section if you have a group selected (and, even if selecting a group were the same as selecting all of its files, the Target Membership checkboxes are disabled if you select more than one file).
So, is this functionality still there, but hidden somewhere I haven't been able to find it?
If not, obviously there are other ways I can do what I want—script Xcode, parse the .pbxproj file, or abstract the group into a subproject or an entirely separate project that builds a static lib, etc. But I'd love to be able to work with Xcode here, the way I did in 3.x, instead of have to fight against it.
Actually, scripting Xcode doesn't seem to work. Any attempt to get the build files of a build phase fails with a generic -10000 error. For example:
tell application "Xcode"
set theproject to project "SampleProject"
set thetarget to target "SampleTarget" of theproject
set thephase to build phase "Compile Sources" of thetarget
build files of phase
end tell
… fails on the last line with:
error "Xcode got an error: AppleEvent handler failed." number -10000
Here's the hack I ended up using—I'd obviously still appreciate a better solution.
#!/usr/bin/env python3
import os
import plistlib
import sys
pbxproj = os.path.join(sys.argv[1], 'project.pbxproj')
groupname = sys.argv[2]
extensions = 'm mm c cc cpp'.split()
with open(pbxproj, 'rb') as f:
p = plistlib.load(f)
objs = p['objects']
groupid, group = next((k, v) for k, v in objs.items()
if v.get('path') == groupname)
def descendants(id):
obj = objs[id]
if obj['isa'] == 'PBXFileReference':
yield (id, obj)
for child in obj.get('children', []):
yield from descendants(child)
mdict = {id: obj for id, obj in descendants(group_id)
if os.path.splitext(obj['path'])[-1] in extensions}
proj_id, proj = next((k, v) for k, v in objs.items()
if v['isa'] == 'PBXProject')
for target_id in proj['targets']:
target = objs[target_id]
phase_ids = target['buildPhases']
phases = [(phase_id, objs[phase_id]) for phase_id in phase_ids]
phase_id, phase = next((phase_id, phase)
for phase_id, phase in phases
if phase['isa'] == 'PBXSourcesBuildPhase')
fileref_ids = [i
for i, buildfile_id in enumerate(phase['files'])
if objs[buildfile_id]['fileRef'] in mdict]
fileref_ids.sort(reverse=True)
for i in fileref_ids:
del phase['files'][i]
with open(pbxproj + '.new', 'wb') as f:
plistlib.dump(p, f)
os.rename(pbxproj, pbxproj + '.bak')
os.rename(pbxproj + '.new', pbxproj)

Resources