fastqc using Snakemake - bioinformatics

fastqc using Snakemake - bioinformatics

I have a list of samples going through Snakemake. When I arrive at my fastqc step, I suddenly have two files per sample (an R1 and R2 file). Consider the following rule:
rule fastqc:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
os.path.join(fastq_dir, '{sample}_R2_fastq.html')
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
This does not work. I also tried the following:
rule fastqc:
input:
expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2']
output:
expand([os.path.join(fastq_dir, '{sample}_R{read}_fastq.html')], read=['1', '2']
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
Which also does not work, I get:
No values given for wildcard 'sample'.
Then I tried:
rule fastqc:
input:
expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2'], sample=samples['samples'])
output:
expand([os.path.join(fastqc_dir, '{sample}_R{read}_fastqc.html')], read=['1', '2'], sample=samples['samples'])
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
But this feeds all fastq files there are into one shell script, it seems.
How should I properly "loop" over multiple inputs for 1 sample?
Highest regards.
Edit:
My rule all looks like this, probably I should also change that, right (see the last 2 lines for fastqc)?
# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
input:
expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
os.path.join(rseqc_dir, '{sample}.junction.xls'),
os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.GC.xls'),
os.path.join(rseqc_dir, '{sample}.NVC.xls'),
os.path.join(rseqc_dir, '{sample}.qual.r'),
os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
os.path.join(fastq_dir, '{sample}_R2_fastq.html')],
sample=samples['samples'])

Yes, this one I figured out by "myself". The magic is in the "rule all" part.
This combination of rules works:
reads = ['1', '2']
# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
input:
expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
os.path.join(rseqc_dir, '{sample}.junction.xls'),
os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.GC.xls'),
os.path.join(rseqc_dir, '{sample}.NVC.xls'),
os.path.join(rseqc_dir, '{sample}.qual.r'),
os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')],
sample=samples['samples'], read=reads)
Notice the simple addition of {read} to the otherwise the same fastqc part and the definition or "reads" at the top (samples is a standard samples list).
The I use this fastqc rule:
rule fastqc:
input:
os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')
output:
os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
It has the same line as the "rule all" (as usual). This works, thanx for the upvotes, Freek out.

Related

if condition in a function - Makefile

I'm working with makefile and the very simple things like if-conditions are not straight forward. it gives me an error that is not readable.
Any idea what's wrong with my following small function?
prepare-test-example:
ifeq ($(ENGINE),'aurora-postgresql')
#cat examples/example.yaml > /tmp/stack_test.yaml
else
#cat examples/example.yaml examples/example_test.yaml > /tmp/stack_test.yaml
endif
The call:
make test ENGINE=aurora-postgresql
/Library/Developer/CommandLineTools/usr/bin/make prepare-test-example ENGINE=aurora-postgresql
ifeq (aurora-postgresql,'aurora-postgresql')
/bin/sh: -c: line 0: syntax error near unexpected token `aurora-postgresql,'aurora-postgresql''
/bin/sh: -c: line 0: `ifeq (aurora-postgresql,'aurora-postgresql')'
make[1]: *** [prepare-test-example] Error 2

You have indented the ifeq so it looks to make like something it should pass to the shell.
Try either
ifeq ($(ENGINE),'aurora-postgresql')
files := examples/example.yaml
else
files := examples/example.yaml examples/example_test.yaml
endif
prepare-test-example:
#cat $(files) > /tmp/stack_test.yaml
or
prepare-test-example:
#if [ "$(ENGINE)" = "'aurora-postgresql'" ]; then \
cat examples/example.yaml \
; else \
cat examples/example.yaml examples/example_test.yaml \
; fi > /tmp/stack_test.yaml
For fun, I refactored out the redirection in the latter (pure shell script) example.
Perhaps you meant ifeq('$(ENGINE)','aurora-postgresql') which would make more sense, and allow for the above code to be simplified somewhat.

Could not read .config parameters in Makefile

I have a simple question about how can makefile read in a varibale is set in .config file.
for example i have CONFIG_a=y CONFIG_b and CONFIG_c as three variables in .config of my linux configuration . I have a Makefile which should define a variable depending upon which CONFIG_X is set
if CONFIG_A
DFLAGS=-DABC
if CONFIG_B
DFLAGS=-DBCD
if CONFIG_C
DFLAGS=-DCDE
How could i achieve this in make file.
I tried
ifeq ($(TARGET_A),y)
DFLAGS=-DABC
else
ifeq ($(TARGET_B),y)
DFLAGS=-DBCD
else
DFLAGS=-DCDE
endif
endif
THIS could not help each time -DCDE is set.

You seem to be using GNU Make.
Let's assume a that exactly one of CONFIG_a, CONFIG_b, CONFIG_c must be set
in foo.config, by a line of the form:
CONFIG_(a|b|c)\s*=\s*y
where \s* means 0 or more spaces.
Then you can parse the setting and conditionally set DFLAGS as shown:-
Makefile
CONFIG := $(filter CONFIG%=y,$(shell tr -d ' ' < foo.config))
ifneq ($(words $(CONFIG)),1)
$(error Exactly one CONFIG_? must be set)
endif
ifeq ($(CONFIG),CONFIG_a=y)
DFLAGS:=-DABC
endif
ifeq ($(CONFIG),CONFIG_b=y)
DFLAGS=-DBCD
endif
ifeq ($(CONFIG),CONFIG_c=y)
DFLAGS=-DCDE
endif
$(info CONFIG=$(CONFIG))
.PHONY: all
all:
#echo DFLAGS=$(DFLAGS)
With:
foo.config (1)
CONFIG_a = y
CONFIG_b
CONFIG_c
you'll get:
$ make
CONFIG=CONFIG_a=y
DFLAGS=-DABC
With:
foo.config (2)
CONFIG_a = n
CONFIG_b=y
CONFIG_c
you'll get:
$ make
CONFIG=CONFIG_b=y
DFLAGS=-DBCD
With:
foo.config (3)
CONFIG_a = y
CONFIG_b
CONFIG_c=y
you'll get:
$ make
Makefile:3: *** Exactly one CONFIG_? must be set. Stop.
And with:
foo.config (4)
CONFIG_a = n
CONFIG_b
CONFIG_c
again:
$ make
Makefile:3: *** Exactly one CONFIG_? must be set. Stop.
To understand how it works see:-
man tr
8.13 The shell Function in the GNU Make manual
8.2 Functions for String Substitution and Analysis in the GNU Make manual.

function call as target in implicit rules

In my Makefile, I have a recipe that generates multiple files. Something like this:
foo-% bar-%: foobar-%
grep "foo" $^ > foo-$*
grep "bar" $^ > bar-$*
which works as expected:
$ make foo-lol
grep "foo" foobar-lol > foo-lol
grep "bar" foobar-lol > bar-lol
In my real case, the target and prerequisite filepaths are a lot more complicated (see at the end) and used elsewhere so I placed the logic in function like so:
target_names = foo-$(1) bar-$(1)
and tried to use it on my recipe
$(call target_names,%): foobar-%
grep "foo" $^ > foo-$*
grep "bar" $^ > bar-$*
Which to my surprise, also worked:
$ make foo-lol
grep "foo" foobar-lol > foo-lol
grep "bar" foobar-lol > bar-lol
I don't understand why it works. How does Make find this recipe? I always thought that Make would go through the list of targets in all recipes. But if that's so, what is the logic behind it when instead of target filenames there is a function call? How does Make find the right input for the function call that returns the wanted target?
As I mentioned, my real example is quite more complex with 8 target and 2 prerequisites:
target_names = \
$(foreach tissue, $(TISSUE), \
$(foreach chain, $(CHAIN), \
$(foreach direction, R1 R2, \
$(foreach read, $(filter %_$(direction)_001.fastq, $(1)), \
$(patsubst data/%_$(direction)_001.fastq, \
results/%-$(tissue)-$(chain)-$(direction).fastq, \
$(read))))))

The function call is processed before the rule is added to the list of rules, you can verify this by running make -Rpn.
$(call target_names,%)
-> substitutes foo-$(1) bar-$(1), where $1 == %
-> returns foo-% bar-%
-> foo-% bar-%: foobar-% is added to the list of rules

Makefile. Multidimensional list?

I need to write a pattern rule for the following case:
There are 2 folders: A and B
Running the command python gen.py --a=A/file1.foo --b=file2.bar --c=file3.bar generates B/file1.foo
file1, file2 and file3 are different strings
Is there a way to group those filenames in some kind of a multidimensional array, so that all files are written exactly once (I'll use python syntax):
files = [["a1.foo", "a2.bar", "a3.bar"],
#...200 other groups...
["b1.foo", "b2.bar", "b3.bar"]]
and then the rule looks like this:
$(files): B/{reference 1 elem}: A/{1 elem} {2 elem} {3 elem}
python gen.py --a=A/{1 elem} --b={2 elem} --c={3 elem}
Any ideas how to archive it?

You can use standard make syntax for that:
all :
targets :=
define add_target
B/${1}: A/${1} ${2} ${3}
targets += B/${1}
endef
# Build dependencies.
$(eval $(call add_target,a1.foo,a2.bar,a3.bar))
# ...
$(eval $(call add_target,b1.foo,b2.bar,b3.bar))
# One generic rule for all ${targets}
${targets} : % :
#echo Making $# from $^
all : ${targets}
.PHONY: all
Note that these $(eval $(call add_target,...) are white-space sensitive, do not insert spaces in there.
If you would like make to create the directory for outputs automatically do:
${targets} : % : | B
B :
mkdir $#

Sometimes a little repetition isn't so bad really
targets := B/a1.foo B/b1.foo
.PHONY: all
all: $(targets)
$(targets): B/%: A/%
python gen.py --a=$< --b=$(word 2,$^) --c=$(word 3,$^)
B/a1.foo: a2.bar a3.bar
B/b1.foo: b2.bar b3.bar

In Kernel makefile $(call cmd, tags) what is the cmd here refers to?

In Kernel Makefile i found the code like below:
ctags CTAGS CSCOPE: $(HEADERS) $(SOURCES)
$(ETAGS) $(ETAGSFALGS) $(HEADERS) $(SOURCES)
$(call cmd, ctags)
Also, where can i find the Macro or function ?

Using MadScientist's method on kernel v4.1:
make -p | grep -B1 -E '^cmd '
we find:
# makefile (from `scripts/Kbuild.include', line 211)
cmd = #$(echo-cmd) $(cmd_$(1))
scripts/Kbuild.include is included on the top level Makefile. It also contains:
echo-cmd = $(if $($(quiet)cmd_$(1)),\
echo ' $(call escsq,$($(quiet)cmd_$(1)))$(echo-why)';)
quiet: set at the top level makefile, depending on the value of V.
Will be either:
quiet_ to print CC file.c
empty to print the command on V=
silent_ to not print anything on make -s
escsq is defined as:
squote := '
escsq = $(subst $(squote),'\$(squote)',$1)
It escapes single quotes so that echo '$(call escsq,Letter 'a'.' will print properly in sh.
echo-why: defined further down at Kbuild.include.
It is used for make V=2, and says why a target is being remade.
The setup of make tags is done in the Makefile:
quiet_cmd_tags = GEN $#
cmd_tags = $(CONFIG_SHELL) $(srctree)/scripts/tags.sh $#
tags TAGS cscope gtags: FORCE
$(call cmd,tags)
Which shows the typical usage pattern for calling commands on kbuild:
quiet_cmd_XXX = NAME $#
cmd_XXX = actual-command $#
target: prerequisites
$(call cmd,tags)
A comment on the Makefile explains how all of this is done to make the make output prettier:
# Beautify output
# ---------------------------------------------------------------------------
#
# Normally, we echo the whole command before executing it. By making
# that echo $($(quiet)$(cmd)), we now have the possibility to set
# $(quiet) to choose other forms of output instead, e.g.
#
# quiet_cmd_cc_o_c = Compiling $(RELDIR)/$#
# cmd_cc_o_c = $(CC) $(c_flags) -c -o $# $<

If you run make -p it will print the entire database of all variables, rules, etc. with line numbers where they were last defined.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

fastqc using Snakemake - bioinformatics

Related

if condition in a function - Makefile

Could not read .config parameters in Makefile

function call as target in implicit rules

Makefile. Multidimensional list?

In Kernel makefile $(call cmd, tags) what is the cmd here refers to?

Categories

Resources