baseDir issue with nextflow - bioinformatics

This might be a very basic question for you guys, however, I am have just started with nextflow and I struggling with the simplest example.
I first explain what I have done and the problem.
Aim: I aim to make a workflow for my bioinformatics analyses as the one here (https://www.nextflow.io/example4.html)
Background: I have installed all the packages that were needed and they all work from the console without any error.
My run: I have used the same script as in example only by replacing the directory names. Here is how I have arranged the directories
location of script
~/raman/nflow/script.nf
location of Fastq files
~/raman/nflow/Data/T4_1.fq.gz
~/raman/nflow/Data/T4_2.fq.gz
Location of transcriptomic file
~/raman/nflow/Genome/trans.fa
The script
#!/usr/bin/env nextflow
/*
* The following pipeline parameters specify the refence genomes
* and read pairs and can be provided as command line options
*/
params.reads = "$baseDir/Data/T4_{1,2}.fq.gz"
params.transcriptome = "$baseDir/HumanGenome/SalmonIndex/gencode.v42.transcripts.fa"
params.outdir = "results"
workflow {
read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
INDEX(params.transcriptome)
FASTQC(read_pairs_ch)
QUANT(INDEX.out, read_pairs_ch)
}
process INDEX {
tag "$transcriptome.simpleName"
input:
path transcriptome
output:
path 'index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i index
"""
}
process FASTQC {
tag "FASTQC on $sample_id"
publishDir params.outdir
input:
tuple val(sample_id), path(reads)
output:
path "fastqc_${sample_id}_logs"
script:
"""
fastqc "$sample_id" "$reads"
"""
}
process QUANT {
tag "$pair_id"
publishDir params.outdir
input:
path index
tuple val(pair_id), path(reads)
output:
path pair_id
script:
"""
salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id
"""
}
Output:
(base) ntr#ser:~/raman/nflow$ nextflow script.nf
N E X T F L O W ~ version 22.10.1
Launching `script.nf` [modest_meninsky] DSL2 - revision: 032a643b56
executor > local (2)
executor > local (2)
[- ] process > INDEX (gencode) -
[28/02cde5] process > FASTQC (FASTQC on T4) [100%] 1 of 1, failed: 1 ✘
[- ] process > QUANT -
Error executing process > 'FASTQC (FASTQC on T4)'
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Command executed:
fastqc "T4" "T4_1.fq.gz T4_2.fq.gz"
Command exit status:
0
Command output:
(empty)
Command error:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
Work dir:
/home/ruby/raman/nflow/work/28/02cde5184f4accf9a05bc2ded29c50
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
I believe I have an issue with my baseDir understanding. I am assuming that the baseDir is the one where I have my file script.nf I am not sure what is going wrong and how can I fix it.
Could anyone please help or guide.
Thank you

Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Nextflow complains when it can't find the declared output files. This can occur even if the command completes successfully, i.e. with exit status 0. The problem here is that fastqc simply skips files that don't exist or can't be read (e.g. permissions problems), but it does produce these warnings:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
The solution is to just make sure all files exist. Note that the fromFilePairs factory method produces a list of files in the second element. Therefore quoting a space-separated pair of filenames is also problematic. All you need is:
script:
"""
fastqc ${reads}
"""

Related

Syntax conflict for "{" using Nextflow

New to nextflow, attempted to run a loop in nextflow chunk to remove extension from sequence file names and am running into a syntax error.
params.rename = "sequences/*.fastq.gz"
workflow {
rename_ch = Channel.fromPath(params.rename)
RENAME(rename_ch)
RENAME.out.view()
}
process RENAME {
input:
path read
output:
stdout
script:
"""
for file in $baseDir/sequences/*.fastq.gz;
do
mv -- '$file' '${file%%.fastq.gz}'
done
"""
}
Error:
- cause: Unexpected input: '{' # line 25, column 16.
process RENAME {
^
Tried to use other methods such as basename, but to no avail.
Inside a script block, you just need to escape the Bash dollar-variables and use double quotes so that they can expand. For example:
params.rename = "sequences/*.fastq.gz"
workflow {
RENAME()
}
process RENAME {
debug true
"""
for fastq in ${baseDir}/sequences/*.fastq.gz;
do
echo mv -- "\$fastq" "\${fastq%%.fastq.gz}"
done
"""
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [crazy_brown] DSL2 - revision: 71ada7b0d5
executor > local (1)
[71/4321e6] process > RENAME [100%] 1 of 1 ✔
mv -- /path/to/sequences/A.fastq.gz /path/to/sequences/A
mv -- /path/to/sequences/B.fastq.gz /path/to/sequences/B
mv -- /path/to/sequences/C.fastq.gz /path/to/sequences/C
Also, if you find escaping the Bash variables tedious, you may want to consider using a shell block instead.

How to return output of shell script into Jenkinsfile [duplicate]

I have something like this on a Jenkinsfile (Groovy) and I want to record the stdout and the exit code in a variable in order to use the information later.
sh "ls -l"
How can I do this, especially as it seems that you cannot really run any kind of groovy code inside the Jenkinsfile?
The latest version of the pipeline sh step allows you to do the following;
// Git committer email
GIT_COMMIT_EMAIL = sh (
script: 'git --no-pager show -s --format=\'%ae\'',
returnStdout: true
).trim()
echo "Git committer email: ${GIT_COMMIT_EMAIL}"
Another feature is the returnStatus option.
// Test commit message for flags
BUILD_FULL = sh (
script: "git log -1 --pretty=%B | grep '\\[jenkins-full]'",
returnStatus: true
) == 0
echo "Build full flag: ${BUILD_FULL}"
These options where added based on this issue.
See official documentation for the sh command.
For declarative pipelines (see comments), you need to wrap code into script step:
script {
GIT_COMMIT_EMAIL = sh (
script: 'git --no-pager show -s --format=\'%ae\'',
returnStdout: true
).trim()
echo "Git committer email: ${GIT_COMMIT_EMAIL}"
}
Current Pipeline version natively supports returnStdout and returnStatus, which make it possible to get output or status from sh/bat steps.
An example:
def ret = sh(script: 'uname', returnStdout: true)
println ret
An official documentation.
quick answer is this:
sh "ls -l > commandResult"
result = readFile('commandResult').trim()
I think there exist a feature request to be able to get the result of sh step, but as far as I know, currently there is no other option.
EDIT: JENKINS-26133
EDIT2: Not quite sure since what version, but sh/bat steps now can return the std output, simply:
def output = sh returnStdout: true, script: 'ls -l'
If you want to get the stdout AND know whether the command succeeded or not, just use returnStdout and wrap it in an exception handler:
scripted pipeline
try {
// Fails with non-zero exit if dir1 does not exist
def dir1 = sh(script:'ls -la dir1', returnStdout:true).trim()
} catch (Exception ex) {
println("Unable to read dir1: ${ex}")
}
output:
[Pipeline] sh
[Test-Pipeline] Running shell script
+ ls -la dir1
ls: cannot access dir1: No such file or directory
[Pipeline] echo
unable to read dir1: hudson.AbortException: script returned exit code 2
Unfortunately hudson.AbortException is missing any useful method to obtain that exit status, so if the actual value is required you'd need to parse it out of the message (ugh!)
Contrary to the Javadoc https://javadoc.jenkins-ci.org/hudson/AbortException.html the build is not failed when this exception is caught. It fails when it's not caught!
Update:
If you also want the STDERR output from the shell command, Jenkins unfortunately fails to properly support that common use-case. A 2017 ticket JENKINS-44930 is stuck in a state of opinionated ping-pong whilst making no progress towards a solution - please consider adding your upvote to it.
As to a solution now, there could be a couple of possible approaches:
a) Redirect STDERR to STDOUT 2>&1
- but it's then up to you to parse that out of the main output though, and you won't get the output if the command failed - because you're in the exception handler.
b) redirect STDERR to a temporary file (the name of which you prepare earlier) 2>filename (but remember to clean up the file afterwards) - ie. main code becomes:
def stderrfile = 'stderr.out'
try {
def dir1 = sh(script:"ls -la dir1 2>${stderrfile}", returnStdout:true).trim()
} catch (Exception ex) {
def errmsg = readFile(stderrfile)
println("Unable to read dir1: ${ex} - ${errmsg}")
}
c) Go the other way, set returnStatus=true instead, dispense with the exception handler and always capture output to a file, ie:
def outfile = 'stdout.out'
def status = sh(script:"ls -la dir1 >${outfile} 2>&1", returnStatus:true)
def output = readFile(outfile).trim()
if (status == 0) {
// output is directory listing from stdout
} else {
// output is error message from stderr
}
Caveat: the above code is Unix/Linux-specific - Windows requires completely different shell commands.
this is a sample case, which will make sense I believe!
node('master'){
stage('stage1'){
def commit = sh (returnStdout: true, script: '''echo hi
echo bye | grep -o "e"
date
echo lol''').split()
echo "${commit[-1]} "
}
}
For those who need to use the output in subsequent shell commands, rather than groovy, something like this example could be done:
stage('Show Files') {
environment {
MY_FILES = sh(script: 'cd mydir && ls -l', returnStdout: true)
}
steps {
sh '''
echo "$MY_FILES"
'''
}
}
I found the examples on code maven to be quite useful.
All the above method will work. but to use the var as env variable inside your code you need to export the var first.
script{
sh " 'shell command here' > command"
command_var = readFile('command').trim()
sh "export command_var=$command_var"
}
replace the shell command with the command of your choice. Now if you are using python code you can just specify os.getenv("command_var") that will return the output of the shell command executed previously.
How to read the shell variable in groovy / how to assign shell return value to groovy variable.
Requirement : Open a text file read the lines using shell and store the value in groovy and get the parameter for each line .
Here , is delimiter
Ex: releaseModule.txt
./APP_TSBASE/app/team/i-home/deployments/ip-cc.war/cs_workflowReport.jar,configurable-wf-report,94,23crb1,artifact
./APP_TSBASE/app/team/i-home/deployments/ip.war/cs_workflowReport.jar,configurable-temppweb-report,394,rvu3crb1,artifact
========================
Here want to get module name 2nd Parameter (configurable-wf-report) , build no 3rd Parameter (94), commit id 4th (23crb1)
def module = sh(script: """awk -F',' '{ print \$2 "," \$3 "," \$4 }' releaseModules.txt | sort -u """, returnStdout: true).trim()
echo module
List lines = module.split( '\n' ).findAll { !it.startsWith( ',' ) }
def buildid
def Modname
lines.each {
List det1 = it.split(',')
buildid=det1[1].trim()
Modname = det1[0].trim()
tag= det1[2].trim()
echo Modname
echo buildid
echo tag
}
If you don't have a single sh command but a block of sh commands, returnstdout wont work then.
I had a similar issue where I applied something which is not a clean way of doing this but eventually it worked and served the purpose.
Solution -
In the shell block , echo the value and add it into some file.
Outside the shell block and inside the script block , read this file ,trim it and assign it to any local/params/environment variable.
example -
steps {
script {
sh '''
echo $PATH>path.txt
// I am using '>' because I want to create a new file every time to get the newest value of PATH
'''
path = readFile(file: 'path.txt')
path = path.trim() //local groovy variable assignment
//One can assign these values to env and params as below -
env.PATH = path //if you want to assign it to env var
params.PATH = path //if you want to assign it to params var
}
}
Easiest way is use this way
my_var=`echo 2`
echo $my_var
output
: 2
note that is not simple single quote is back quote ( ` ).

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

Bash array in Declarative Jenkinsfile

How do I use shell arrays in a Jenkinsfile?
My Jenkins job has a String parameter PROJECTS that is a comma-separated list of projects to build. I have a Build step in which I run some shell script to split that parameter into an array, and then pass that array to a build script:
...
stage("Build") {
steps {
sh"""
projects_list=(${env.PROJECTS//,/ })
./build_script ${projects_list[#]}
"""
}
}
...
however, the Jenkins build keeps failing due to this:
WorkflowScript: 132: unexpected token: # # line 132, column 104.
build_script ${projects_list[#]}
^
1 error
Please see the below code which gives desired result:
Please note : I am using bat command and calling shell scripts inside via cygwin as am using Windows machine.
...
def PROJECTS = "ABC,XYZ"
stage("Build") {
steps {
bat'cygwin.bat -c \"projects_list=(${PROJECTS//,/ }); ./buildscript.sh ${projects_list[#]} \"'
}
}
...
cygwin.bat
IF [%1] == [-c] (
C:\Cygwin\bin\bash.exe -l -i %*
) ELSE (
startC:\Cygwin\bin\mintty.exe --exec C:\Cygwin\bin\bash.exe -l -i
)
With sh: The syntax would be same, just use sh rather than bat and call the command without cywgin.bat -c

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")
I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Resources