WildcardError in Snakefile - bioinformatics

I've been trying to run the following bioinformatic script:
configfile: "config.yaml"
WORK_TRIM = config["WORK_TRIM"]
WORK_KALL = config["WORK_KALL"]
rule all:
input:
expand(WORK_KALL + "quant_result_{condition}", condition=config["conditions"])
rule kallisto_quant:
input:
fq1 = WORK_TRIM + "{sample}_1_trim.fastq.gz",
fq2 = WORK_TRIM + "{sample}_2_trim.fastq.gz",
idx = WORK_KALL + "Homo_sapiens.GRCh38.cdna.all.fa.index"
output:
WORK_KALL + "quant_result_{condition}"
shell:
"kallisto quant -i {input.idx} -o {output} {input.fq1} {input.fq2}"
However, I keep obtaing an error like this:
WildcardError in line 13 of /home/user/directory/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'
Just to explain briefly, kallisto quant will produce 3 outputs: abundance.h5, abundance.tsv and run_injo.json. Each of those files need to be sent to their own newly created condition directory. I not getting exactly what is going on wrong. I'll appreciated any help on this.

If you think about it, you are not giving snakemake enough information.
Say "condition" is either "control" or "treated" with samples "C" and "T", respectively. You need to tell snakemake about the association control: C, treated: T. You could do this using functions-as-input files or lambda functions. For example:
cond2samp = {'control': 'C', 'treated': 'T'}
rule all:
input:
expand("quant_result_{condition}", condition=cond2samp.keys())
rule kallisto_quant:
input:
fq1 = lambda wc: "%s_1_trim.fastq.gz" % cond2samp[wc.condition],
fq2 = lambda wc: "%s_2_trim.fastq.gz" % cond2samp[wc.condition],
idx = "Homo_sapiens.GRCh38.cdna.all.fa.index"
output:
"quant_result_{condition}"
shell:
"kallisto quant -i {input.idx} -o {output} {input.fq1} {input.fq2}"

Related

Accessing Snakemake Config Samples

I have a rule that needs to take 2 samples and combine them.
This is how my samples look like in my config file:
samples:
group1:
sra1:
sample: "SRR14724462"
cell_line: "NA24385"
exome_bedfile: "/bedfiles/truseq.sorted.bed"
sra2:
sample: "SRR14724472"
cell_line: "NA24385"
exome_bedfile: "/bedfiles/idt.sorted.bed"
group2:
sra1:
sample: "SRR14724463"
cell_line: "NA12878"
exome_bedfile: "/bedfiles/truseq.sorted.bed"
sra2:
sample: "SRR14724473"
cell_line: "NA12878"
exome_bedfile: "/bedfiles/idt.sorted.bed"
Essentially I want to combine group1 sra1 together, and group2 sra2 together, into these combinations:
SRR14724462 and SRR14724463
SRR14724472 and SRR14724473
This is my rule and rule all:
rule combine:
output:
r1 = TRIMMED_DIR + "/{sample1}_{sample2}_R1.fastq",
r2 = TRIMMED_DIR + "/{sample1}_{sample2}_R2.fastq"
params:
trimmed_dir = TRIMMED_DIR,
a = "{sample1}",
b = "{sample2}"
shell:
cd {params.trimmed_dir}
/combine.sh {params.a}_R1_trimmed.fastq {params.a}_R2_trimmed.fastq {params.b}_R1_trimmed.fastq {params.b}_R2_trimmed.fastq
rule all:
expand(TRIMMED_DIR + "/{sample1}_{sample2}_R1.fastq", sample1=list_a, sample2=list_b),
expand(TRIMMED_DIR + "/{sample1}_{sample2}_R2.fastq", sample1=list_a, sample2=list_b)
This works EXCEPT it does these combinations:
SRR14724462 and SRR14724463
SRR14724462 and SRR14724473
SRR14724472 and SRR14724463
SRR14724472 and SRR14724473
I only want these combinations:
SRR14724462 and SRR14724463
SRR14724472 and SRR14724473
Note: Not shown is how i got list_a and list_b, but essentially they are:
list_a = ['SRR14724462', 'SRR14724472']
list_b = ['SRR14724463', 'SRR14724473']

Syntax to generate a Syntax in SPSS

I’m trying to construct a Syntax to generate a Syntax in SPSS, but I’m having some issues…
I have an excel file with metadata and I would like to use it in order to make a syntax to extract information from it (like this, if I have a huge database, I just need to keep the excel updated – add/delete variables, etc. - and then run a syntax to extract the needed information for a new syntax).
I also noticed the produced syntax has always around 15Mb, which is a lot (applied to more than 500 lines)!
I don’t use Python due to run syntax in different computers and/or configurations.
Any ideas? Can anyone please help me?
Thank you in advance.
Example:
(test.xlsx – sheet 1)
Var Code Label List Var_label (concatenate Var+Label)
V1 3 Sex 1 V1 “Sex”
V2 1 Work 2 V2 “Work”
V3 3 Country 3 V3 “Country”
V4 1 Married 2 V4 “Married”
V5 1 Kids 2 V5 “Kids”
V6 2 Satisf1 4 V6 “Satisf1”
V7 2 Satisf2 4 V7 “Satisf2”
(information from other file)
List = 1
1 “Male”
2 “Female”
List = 2
1 “Yes”
2 “No”
List = 3
1 “Europe”
2 “America”
3 “Asia”
4 “Africa”
5 “Oceania”
List = 4
1 “Very unsatisfied”
10 “Very satisfied”
I want to make a Syntax that generates a new syntax to apply “VARIABLE LABELS” and “VALUE LABELS”. So, I thought about something like this:
GET DATA
/TYPE=XLSX
/FILE="test.xlsx"
/SHEET=name 'sheet 1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0.
EXECUTE.
STRING vlb (A15) labels (A150) value (A12) lab (A1500) point (A2) separate (A50) space (A2) list1 (A100) list2 (A100).
SELECT IF (Code=1).
COMPUTE vlb = "VARIABLE LABELS".
COMPUTE labels = CONCAT (RTRIM(Var_label)," ").
COMPUTE point = ".".
COMPUTE value = "VALUE LABELS".
COMPUTE lab = CONCAT (RTRIM(Var)," ").
COMPUTE list1 = '1 " Yes "'.
COMPUTE list2 = '2 "No".'.
COMPUTE space = " ".
COMPUTE separate="************************************************.".
WRITE OUTFILE = "list_01.sps" / vlb.
WRITE OUTFILE = "list_01.sps" /labels.
WRITE OUTFILE = "list_01.sps" /point.
WRITE OUTFILE = "list_01.sps" /value.
WRITE OUTFILE = "list_01.sps" /lab.
WRITE OUTFILE = "list_01.sps" /list1.
WRITE OUTFILE = "list_01.sps" /list2.
WRITE OUTFILE = "list_01.sps" /space.
WRITE OUTFILE = "list_01.sps" /separate.
WRITE OUTFILE = "list_01.sps" /space.
If there is only one variable with same list (ex: V1), it works ok. However, if there is more than one variable having the same list, it reproduces the codes as much times as number of variables (Ex: V2, V4 and V5).
What I have (Ex: V2, V4 and V5), after running code above:
VARIABLE LABELS
V2 "Work"
.
VALUE LABELS
V2
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V4 "Married"
.
VALUE LABELS
V4
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V5 "Kids"
.
VALUE LABELS
V5
1 " Yes "
2 " No "
************************************************.
What I would like to have:
VARIABLE LABELS
V2 "Work"
V4 "Married"
V5 "Kids"
.
VALUE LABELS
V2 V4 V5
1 " Yes "
2 " No "
I think there are probably ways to automate the whole process better, including the use of your second data source. But for the scope of this question I will suggest a way to get what you asked for specifically.
The key is to build the command with special conditions for first and last lines:
string cmd1 cmd2 (a200).
sort cases by code.
match files /file=* /first=first /last=last /by code. /* marking first and last lines.
do if first.
compute cmd1="VARIABLE LABELS".
compute cmd2="VALUE LABELS".
end if.
if not first cmd1=concat(rtrim(cmd1), " /"). /* "/" only appears from the second varname.
compute cmd1=concat(rtrim(cmd1), " ", Var_label).
compute cmd2=concat(rtrim(cmd2), " ", Var).
do if last.
compute cmd1=concat(rtrim(cmd1), " .").
compute cmd2=concat(rtrim(cmd2), " ", ' 1 " Yes " 2 "No". ').
end if.
exe.
The commands are now ready, but we don't want to get them mixed up so we'll stack them one under the other, and only then write them out:
add files /file=* /rename cmd1=cmd /file=* /rename cmd2=cmd.
exe.
WRITE OUTFILE = "var definitions.sps" / cmd .
exe.
EDIT:
Note that the code above assumes you've already run a select cases if code = ... and that there is a single code in all the remaining lines.
Note also I added an exe. command at the end - without running that the new syntax will appear empty.

Snakemake rules

I want to use snakemake for making a bioinformatics pipeline and I googled it and read documents and other stuff, but I still don't know how to get it works.
Here are some of my raw data files.
Rawdata/010_0_bua_1.fq.gz, Rawdata/010_0_bua_2.fq.gz
Rawdata/11_15_ap_1.fq.gz, Rawdata/11_15_ap_2.fq.gz
...they are all paired files.)
Here is my align.snakemake
from os.path import join
STAR_INDEX = "/app/ref/ensembl/human/idx/"
SAMPLE_DIR = "Rawdata"
SAMPLES, = glob_wildcards(SAMPLE_DIR + "/{sample}_1.fq.gz")
R1 = '{sample}_1.fq.gz'
R2 = '{sample}_2.fq.gz'
rule alignment:
input:
r1 = join(SAMPLE_DIR, R1),
r2 = join(SAMPLE_DIR, R2),
params:
STAR_INDEX = STAR_INDEX
output:
"Align/{sample}.bam"
message:
"--- mapping STAR---"
shell:"""
mkdir -p Align/{wildcards.sample}
STAR --genomeDir {params.STAR_INDEX} --readFilesCommand zcat --readFilesIn {input.r1} {input.r2} --outFileNamePrefix Align/{wildcards.sample}/log
"""
This is it. I run this file by "snakemake -np -s align.snakemake" and I got this error.
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
I am sorry that I ask this question, there are many people using it pretty well though. any help would be really appriciated. Sorry for my English.
P.S. I read the official document and tutorial but still have no idea.
Oh I did. Here is my answer to my question for some people might want some help.
from os.path import join
STAR_INDEX = "/app/ref/ensembl/human/idx/"
SAMPLE_DIR = "Rawdata"
SAMPLES, = glob_wildcards(SAMPLE_DIR + "/{sample}_1.fq.gz")
R1 = '{sample}_1.fq.gz'
R2 = '{sample}_2.fq.gz'
rule all:
input:
expand("Align/{sample}/Aligned.toTranscriptome.out.bam", sample=SAMPLES)
rule alignment:
input:
r1 = join(SAMPLE_DIR, R1),
r2 = join(SAMPLE_DIR, R2)
params:
STAR_INDEX = STAR_INDEX
output:
"Align/{sample}/Aligned.toTranscriptome.out.bam"
threads:
8
message:
"--- Mapping STAR---"
shell:"""
mkdir -p Align/{wildcards.sample}
STAR --genomeDir {params.STAR_INDEX} --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN {threads} --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD \#HD VN:1.4 SO:unsorted --readFilesCommand zcat --readFilesIn {input.r1} {input.r2} --outFileNamePrefix Align/{wildcards.sample}/log
"""

Makefile: $subst in dependency list

I have a Makefile which looks roughly like this:
FIGURES = A1_B1_C1.eps A2_B2_C2.eps A3_B3_C3.eps
NUMBERS = 1 2 3
all : $(FIGURES)
%.eps : $(foreach num, $(NUMBERS), $(subst B, $(num), %).out)
# my_program($+, $#);
%.out :
The point is that the file names of my figures contain certain information (A, B, C) and that each figure is created by my_program from several (in the example 3) files.
While the filename of each figure has the format Ax_Bx_Cx.eps, the names of the data files to create the figures from look like this:
Ax_1x_Cx.out
Ax_2x_Cx.out
Ax_3x_Cx.out
So for each figure, I need a dynamically created dependency list with several file names. In other words, my desired output for the example above would be:
# my_program(A1_11_C1.out A1_21_C1.out A1_31_C1.out, A1_B1_C1.eps);
# my_program(A2_12_C2.out A2_22_C2.out A2_32_C2.out, A2_B2_C2.eps);
# my_program(A3_13_C3.out A3_23_C3.out A3_33_C3.out, A3_B2_C3.eps);
Unfortunately, the subst command seems to be ignored, for the output looks like this:
# my_program(A1_B1_C1.out A1_B1_C1.out A1_B1_C1.out, A1_B1_C1.eps);
# my_program(A2_B2_C2.out A2_B2_C2.out A2_B2_C2.out, A2_B2_C2.eps);
# my_program(A3_B3_C3.out A3_B3_C3.out A3_B3_C3.out, A3_B3_C3.eps);
I had a look at this possible duplicate but figured that the answer cannot help me, since I am using % and not $#, which should be ok in the prerequisites.
Clearly I am getting something wrong here. Any help is greatly appreciated.
To do fancy prerequisite manipulations you need at least make-3.82 which supports Secondary Expansion feature:
FIGURES = A1_B1_C1.eps A2_B2_C2.eps A3_B3_C3.eps
NUMBERS = 1 2 3
all : $(FIGURES)
.SECONDEXPANSION:
$(FIGURES) : %.eps : $$(foreach num,$$(NUMBERS),$$(subst B,$$(num),$$*).out)
#echo "my_program($+, $#)"
%.out :
touch $#
Output:
$ make
touch A1_11_C1.out
touch A1_21_C1.out
touch A1_31_C1.out
my_program(A1_11_C1.out A1_21_C1.out A1_31_C1.out, A1_B1_C1.eps)
touch A2_12_C2.out
touch A2_22_C2.out
touch A2_32_C2.out
my_program(A2_12_C2.out A2_22_C2.out A2_32_C2.out, A2_B2_C2.eps)
touch A3_13_C3.out
touch A3_23_C3.out
touch A3_33_C3.out
my_program(A3_13_C3.out A3_23_C3.out A3_33_C3.out, A3_B3_C3.eps)

Running R Code from Command Line (Windows)

I have some R code inside a file called analyse.r. I would like to be able to, from the command line (CMD), run the code in that file without having to pass through the R terminal and I would also like to be able to pass parameters and use those parameters in my code, something like the following pseudocode:
C:\>(execute r script) analyse.r C:\file.txt
and this would execute the script and pass "C:\file.txt" as a parameter to the script and then it could use it to do some further processing on it.
How do I accomplish this?
You want Rscript.exe.
You can control the output from within the script -- see sink() and its documentation.
You can access command-arguments via commandArgs().
You can control command-line arguments more finely via the getopt and optparse packages.
If everything else fails, consider reading the manuals or contributed documentation
Identify where R is install. For window 7 the path could be
1.C:\Program Files\R\R-3.2.2\bin\x64>
2.Call the R code
3.C:\Program Files\R\R-3.2.2\bin\x64>\Rscript Rcode.r
There are two ways to run a R script from command line (windows or linux shell.)
1) R CMD way
R CMD BATCH followed by R script name. The output from this can also be piped to other files as needed.
This way however is a bit old and using Rscript is getting more popular.
2) Rscript way
(This is supported in all platforms. The following example however is tested only for Linux)
This example involves passing path of csv file, the function name and the attribute(row or column) index of the csv file on which this function should work.
Contents of test.csv file
x1,x2
1,2
3,4
5,6
7,8
Compose an R file “a.R” whose contents are
#!/usr/bin/env Rscript
cols <- function(y){
cat("This function will print sum of the column whose index is passed from commandline\n")
cat("processing...column sums\n")
su<-sum(data[,y])
cat(su)
cat("\n")
}
rows <- function(y){
cat("This function will print sum of the row whose index is passed from commandline\n")
cat("processing...row sums\n")
su<-sum(data[y,])
cat(su)
cat("\n")
}
#calling a function based on its name from commandline … y is the row or column index
FUN <- function(run_func,y){
switch(run_func,
rows=rows(as.numeric(y)),
cols=cols(as.numeric(y)),
stop("Enter something that switches me!")
)
}
args <- commandArgs(TRUE)
cat("you passed the following at the command line\n")
cat(args);cat("\n")
filename<-args[1]
func_name<-args[2]
attr_index<-args[3]
data<-read.csv(filename,header=T)
cat("Matrix is:\n")
print(data)
cat("Dimensions of the matrix are\n")
cat(dim(data))
cat("\n")
FUN(func_name,attr_index)
Runing the following on the linux shell
Rscript a.R /home/impadmin/test.csv cols 1
gives
you passed the following at the command line
/home/impadmin/test.csv cols 1
Matrix is:
x1 x2
1 1 2
2 3 4
3 5 6
4 7 8
Dimensions of the matrix are
4 2
This function will print sum of the column whose index is passed from commandline
processing...column sums
16
Runing the following on the linux shell
Rscript a.R /home/impadmin/test.csv rows 2
gives
you passed the following at the command line
/home/impadmin/test.csv rows 2
Matrix is:
x1 x2
1 1 2
2 3 4
3 5 6
4 7 8
Dimensions of the matrix are
4 2
This function will print sum of the row whose index is passed from commandline
processing...row sums
7
We can also make the R script executable as follows (on linux)
chmod a+x a.R
and run the second example again as
./a.R /home/impadmin/test.csv rows 2
This should also work for windows command prompt..
save the following in a text file
f1 <- function(x,y){
print (x)
print (y)
}
args = commandArgs(trailingOnly=TRUE)
f1(args[1], args[2])
No run the following command in windows cmd
Rscript.exe path_to_file "hello" "world"
This will print the following
[1] "hello"
[1] "world"

Resources