Snakemake cluster config in combination with DRMMA

Snakemake cluster config in combination with DRMMA - cluster-computing

I have a question related to drmma and the cluster config file in snakemake.
Currently i have a pipeline and I submit jobs to the cluster using drmma with the following command:
snakemake --drmaa " -q short.q -pe smp 8 -l membycore=4G" --jobs 100 -p file1/out file2/out file3/out
The problem is that some of the rules/jobs require less or more resources. I though that if i used the json cluster file I would be able to submit the jobs with different resources. My json file looks like this:
{
"__default__":
{
"-q":"short.q",
"-pe":"smp 1",
"-l":"membycore=4G"
},
"job1":
{
"-q":"short.q",
"-pe":"smp 8",
"-l":"membycore=4G"
},
"job2":
{
"-q":"short.q",
"-pe":"smp 8",
"-l":"membycore=4G"
}
}
When I run the following command my jobs (job1 and job2) are submitted with default options and not with the custom ones:
snakemake --jobs 100 --cluster-config cluster.json --drmaa -p file1/out file2/out file3/out
What am I doing wrong? Is it that I cannot combine the drmaa option with the cluster-config file?

the cluster config file simply allows you do define variables that are later used in --cluster/--cluster-sync/--drmaa depending on the defined placeholders. There's no DRMAA specific magic involved here. Have a look at the corresponding section in the documentation again.
Maybe an example makes things clearer:
Cluster config:
{
"__default__":
{
"time" : "02:00:00",
"mem" : 1G,
},
# more rule specific definitions here...
}
Example snakemake arguments to make use of the above:
--drmaa " -pe OpenMP {threads} -l mem_free={cluster.mem} -l h_rt={cluster.time}"
or
--cluster-sync "qsub -sync y -pe OpenMP {threads} -l mem_free={cluster.mem} -l h_rt={cluster.time}"
cluster.time and cluster.mem will be replaced accordingly per rule.
Andreas

Related

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!

Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.

Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

How to pass flag values to subcommands in golang urfave cli

I am using urfave at https://github.com/urfave/cli
to create a CLI with two subcommands.
I am able to create a CLI with a subcommand,
but I really have no idea how to define the flags.
What's the difference between the global flag and local flag?

Each command can optionally specify a 'subcommand'. The subcommand is of type Command, which allows for nested / composing commands together.
To achieve something like:
cli-tool command1 command2 --command2flag
you could have a commands structure like:
app := &cli.App{
//...
Commands: []*cli.Command{
{
Name: "command1",
Usage: // ...
Action: //...
SubCommand: []cli.Command{
{
Name: "command2"
Flags: []cli.Flag{
cli.StringFlag{
Name: "command2flag"
// ...
},
},
},
},
},
//...
}
You can see here that command2 is nested in command1's subcommands. And the flags for command2 will only apply to command2. This is an example of a local flag.
Global flags would apply to every command and subcommand. This could be useful for somekind of config that the cli tool might need to use for all commands. e.g. the server address to talk to etc.

snakemake how to encode pair analisys

I want to use gatk recalibration using pair sample ( tumor and normal). I need to parse the data using pandas. That is what I wroted.
expand("mapped_reads/merged_samples/{sample[1][tumor]}/{sample[1][tumor]}_{sample[1][normal]}.bam", sample=read_table(config["conditions"], ",").iterrows())
this is the condition file:
432,433
434,435
I wrote this rule:
rule gatk_RealignerTargetCreator:
input:
"mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
"mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}_{normal}.realign_info.log"
threads: 8
shell:
"gatk -T RealignerTargetCreator -R {params.genome} {params.custom} "
"-nt {threads} "
"-I {wildcard.tumor} -I {wildcard.normal} -known {params.ph1_indels} "
"-o {output} >& {log}"
I have this error:
InputFunctionException in line 17 of /home/maurizio/Desktop/TEST_exome/rules/samfiles.rules:
KeyError: '432/432_433'
Wildcards:
sample=432/432_433
this is the samfiles.rules:
rule samtools_merge_bam:
"""
Merge bam files for multiple units into one for the given sample.
If the sample has only one unit, files will be copied.
"""
input:
lambda wildcards: expand("mapped_reads/bam/{unit}_sorted.bam",unit=config["samples"][wildcards.sample])
output:
"mapped_reads/merged_samples/{sample}.bam"
benchmark:
"benchmarks/samtools/merge/{sample}.txt"
run:
if len(input) > 1:
shell("/illumina/software/PROG2/samtools-1.3.1/samtools merge {output} {input}")
else:
shell("cp {input} {output} && touch -h {output}")

I can only guess because you don't show all relevant rule, but I would say the error occurs because the rule samtools_merge_bam also applies to some later bam file where you have the pattern {tumor}/{tumor}_{normal}...
As a solution, you have to resolve this ambiguity (see the snakemake tutorial). For example, you can constrain the wildcard of samtools_merge_bam to not contain any slashes.
wildcard_constraints:
sample="[^/]+"
You can put the constraint either globally or inside your samtools_merge_bam rule.

Bash Scripting : How to loop over X number of files, take input and write to a file in the same line

So I have a program written in C that takes in some parameters: calling it allcell
some sample parameters: -m 1800 -n 9
the files being analyzed: cfdT100-0.trj, cfdT100-1.trj, cfdT100-2.trj, cfdT100-3.trj, ... cfdT100-19.trj
file being fed: template.file
out file: result.file
$ allcell -m 1800 -n 9 cfdT100-[0-19].trj < template.file > result.file
But when I htop, I see that only cfdT100-0.trj, cfdT100-1.trj and cfdT100-9.trj are being read. How do I make the shell read all the files from 0-19 ?
Additionally, when I write a script file to automate this, how should I enclose the line? Will this work:
"$($ allcell -m 1800 -n 9 cfdT100-[0-19].trj < template.file > result.file)"

I believe you want to change your glob expression to cfdT100-{0..19}.trj instead.
neech#nicolaw.uk:~ $ echo cfdT100-{0..19}.trj
cfdT100-0.trj cfdT100-1.trj cfdT100-2.trj cfdT100-3.trj cfdT100-4.trj cfdT100-5.trj cfdT100-6.trj cfdT100-7.trj cfdT100-8.trj cfdT100-9.trj cfdT100-10.trj cfdT100-11.trj cfdT100-12.trj cfdT100-13.trj cfdT100-14.trj cfdT100-15.trj cfdT100-16.trj cfdT100-17.trj cfdT100-18.trj cfdT100-19.trj
Your quoting on the scripted version looks acceptable. Just change the glob.

use recursion function for infinite loop
a()
{
echo "apple"
a
}
a
This the will make a infinite loop

Merge two json in bash (no jq)

I have two jsons :
env.json
{
"environment":"INT"
}
roles.json
{
"run_list":[
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}
expected output should be some thing like this :
{
"environment":"INT",
"run_list":[
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}
I need to merge these two json (and other like these two) into one single json using only available inbuilt bash commands.
only have sed, cat, echo, tail, wc at my disposal.

Tell whoever put the constraint "bash only" on the project that bash is not sufficient for processing JSON, and get jq.
$ jq --slurp 'add' env.json roles.json

I couldn't use jq either as I was limited due to client's webhost jailing the user on the command line with limited binaries as most discount/reseller web hosting companies do. Luckily they usually have PHP available and you can do a oneliner command like this which something like what I would place in my install/setup bash script for example.
php -r '$json1 = "./env.json";$json2 = "./roles.json";$data = array_merge(json_decode(file_get_contents($json1), true),json_decode(file_get_contents($json2),true));echo json_encode($data, JSON_PRETTY_PRINT);'
For clarity php -r accepts line feeds as well so using this also works.
php -r '
$json1 = "./env.json";
$json2 = "./roles.json";
$data = array_merge(json_decode(file_get_contents($json1), true), json_decode(file_get_contents($json2), true));
echo json_encode($data, JSON_PRETTY_PRINT);'
Output
{
"environment": "INT",
"run_list": [
"recipe[splunk-dj]",
"recipe[tideway]",
"recipe[AlertsSearch::newrelic]",
"recipe[AlertsSearch]"
]
}

A little bit hacky, but hopefully will do.
env_lines=`wc -l < $1`
env_output=`head -n $(($env_lines - 1)) $1`
roles_lines=`wc -l < $2`
roles_output=`tail -n $(($roles_lines - 1)) $2`
echo "$env_output" "," "$roles_output"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Snakemake cluster config in combination with DRMMA - cluster-computing

Related

Nextflow: Missing output file(s) expected by process

How to pass flag values to subcommands in golang urfave cli

snakemake how to encode pair analisys

Bash Scripting : How to loop over X number of files, take input and write to a file in the same line

Merge two json in bash (no jq)

Categories

Resources