RGI output filename does not match snakefile output filename - bioinformatics

The problem is:
A software called "RGI" will automatically append .txt as suffix to the output file. So if my sampleID is 7. Then the actual RGI output file will be 7.txt, which is different from the output file (7) defind in the snakefile rule. And snakemake will report errors like Job Missing files after 20 seconds. However, RGI still appends .txt as suffix even if you have preset a suffix (and the actual output file will look like 7.txt.txt).
How can I solve the problem?
The following is a part of my code:
rule rgi:
output:
cardTxt = "{sampleId}/annotation/rgi/{sampleId}"
input:
faa = rules.prokka.output.faa,
cardDb = config['rgi']['cardDb']
shell:
"""
rgi load -i {input.cardDb}
rgi main -i {input.faa} -t protein -o {output.cardTxt} --include_loose --clean
"""

Strip the .txt prefix from the output filename before passing it to rgi. I do this here using bash string manipulation but you can do it in other ways:
rule rgi:
input:
faa = rules.prokka.output.faa,
cardDb = config['rgi']['cardDb'],
output:
cardTxt = "{sampleID}/annotation/rgi/{sampleID}.txt",
shell:
"""
card=${{{output.cardTxt}%.txt}}
rgi load -i {input.cardDb}
rgi main -i {input.faa} -t protein -o $card --include_loose --clean
"""
(I assume you want .txt to be part of the output filename. I.e. you are ok with 7.txt)

Related

baseDir issue with nextflow

This might be a very basic question for you guys, however, I am have just started with nextflow and I struggling with the simplest example.
I first explain what I have done and the problem.
Aim: I aim to make a workflow for my bioinformatics analyses as the one here (https://www.nextflow.io/example4.html)
Background: I have installed all the packages that were needed and they all work from the console without any error.
My run: I have used the same script as in example only by replacing the directory names. Here is how I have arranged the directories
location of script
~/raman/nflow/script.nf
location of Fastq files
~/raman/nflow/Data/T4_1.fq.gz
~/raman/nflow/Data/T4_2.fq.gz
Location of transcriptomic file
~/raman/nflow/Genome/trans.fa
The script
#!/usr/bin/env nextflow
/*
* The following pipeline parameters specify the refence genomes
* and read pairs and can be provided as command line options
*/
params.reads = "$baseDir/Data/T4_{1,2}.fq.gz"
params.transcriptome = "$baseDir/HumanGenome/SalmonIndex/gencode.v42.transcripts.fa"
params.outdir = "results"
workflow {
read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
INDEX(params.transcriptome)
FASTQC(read_pairs_ch)
QUANT(INDEX.out, read_pairs_ch)
}
process INDEX {
tag "$transcriptome.simpleName"
input:
path transcriptome
output:
path 'index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i index
"""
}
process FASTQC {
tag "FASTQC on $sample_id"
publishDir params.outdir
input:
tuple val(sample_id), path(reads)
output:
path "fastqc_${sample_id}_logs"
script:
"""
fastqc "$sample_id" "$reads"
"""
}
process QUANT {
tag "$pair_id"
publishDir params.outdir
input:
path index
tuple val(pair_id), path(reads)
output:
path pair_id
script:
"""
salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id
"""
}
Output:
(base) ntr#ser:~/raman/nflow$ nextflow script.nf
N E X T F L O W ~ version 22.10.1
Launching `script.nf` [modest_meninsky] DSL2 - revision: 032a643b56
executor > local (2)
executor > local (2)
[- ] process > INDEX (gencode) -
[28/02cde5] process > FASTQC (FASTQC on T4) [100%] 1 of 1, failed: 1 ✘
[- ] process > QUANT -
Error executing process > 'FASTQC (FASTQC on T4)'
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Command executed:
fastqc "T4" "T4_1.fq.gz T4_2.fq.gz"
Command exit status:
0
Command output:
(empty)
Command error:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
Work dir:
/home/ruby/raman/nflow/work/28/02cde5184f4accf9a05bc2ded29c50
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
I believe I have an issue with my baseDir understanding. I am assuming that the baseDir is the one where I have my file script.nf I am not sure what is going wrong and how can I fix it.
Could anyone please help or guide.
Thank you
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Nextflow complains when it can't find the declared output files. This can occur even if the command completes successfully, i.e. with exit status 0. The problem here is that fastqc simply skips files that don't exist or can't be read (e.g. permissions problems), but it does produce these warnings:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
The solution is to just make sure all files exist. Note that the fromFilePairs factory method produces a list of files in the second element. Therefore quoting a space-separated pair of filenames is also problematic. All you need is:
script:
"""
fastqc ${reads}
"""

Use wildcard on params

I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.
The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)
You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""

How can I write multiple lines in expect program for the spawn command?

I have written this little script for getting multiple files from my remote server to my host computer:
#! /usr/bin/expect -f
spawn scp \
user#remote:/home/user/{A.txt,B.txt} \
/home/user_local/Documents
expect "password: "
send "somesecretpwd\r"
interact
This is working fine, but when I want to make new lines between the files like this:
user#remote:/home/user/{A.txt,\
B.txt} \
I am getting the following error(s):
scp: /home/user/{A.txt,: No such file or directory
scp: B.txt}: No such file or directory
I tried this:
user#remote:"/home/user/{A.txt,\
B.txt}" \
getting:
bash: -c: line 0: unexpected EOF while looking for matching `"'
bash: -c: line 1: syntax error: unexpected end of file
cp: cannot stat 'B.txt}"': No such file or directory
or this:
"user#remote:/home/user/{A.txt,\
B.txt}" \
getting the same error at the beginning.
How can I write the files in multiple lines but so that the program is working correctly? I need this for a better readability of the choosen files.
Edit:
Only changed the local user name to user_local
In Tcl (and so Expect) \<NEWLINE><SPACEs> will be converted into one single <SPACE> so you cannot write a string containing no spaces into multiple lines.
% puts "abc\
def"
abc def
% puts {abc\
def}
abc def
%
Assuming the filenames are really longer (not much point otherwise) you could use a couple of variables like this:
#! /usr/bin/expect -f
set A A.txt
set B B.txt
spawn scp \
user#remote:/home/user/{$A,$B} \
/home/user/Documents
expect "password: "
send "somesecretpwd"
interact
For anyone who want to solve a similar problem with only using expect:
You can write a list of files and then concat all files to one string.
Here is the code:
#! /usr/bin/expect -f
set files {\ # a list of files
A.txt\
B.txt\
C.txt\
}
# will return the concatenated string with all files
# in this example it would be: A.txt,B.txt,C.txt
set concat [join $files ,]
# self made version of concat
# set concat [lindex $files 0] # get the first file
# set last_idx [expr {[llength $files]-1}] # calc the last index from the list
# set rest_files [lrange $files 1 $last_idx] # get other files
# foreach file $rest_files {
# set concat $concat,$file # append the concat varibale with a comma and the other file
# }
# # puts "$concat" # only for testing the output
spawn scp \
user#remote:/home/doublepmcl/{$concat} \
/home/user_local/Documents
expect "password: "
send "somesecretpwd\r"
interact

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")
I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Using R with in command bash terminal

I have a set of files *.txt in a specific directory. I have written an .r file code called SampleStatus.r which contains a unique function that reads, proceeses data and writes the results to an output file.
The function is like:
format_windpro(import_file="in.txt", export_file="out.txt")
I would like to use bash commands to read and compute every file in one command using my R file.
Use Rscript. Example code:
for f in ${INPUT_DIR}/*.txt; do \
base=$(basename $f) \
Rscript SampleStatus.R $f ${OUTPUT_DIR}/$base \
done
While in your SampleStatus.R you handle command line arguments like this:
#!/usr/bin/env Rscript
# ...
argv <- commandArgs(T)
# error checking...
import_file <- argv[1]
export_file <- argv[2]
# your function call
format_windpro(import_file, export_file)

Resources