snakemake 6.0.5: Input a list of folders and multiple files from each folder (to merge Lanes) - bioinformatics

Good day. I have some directories (shown in bold below) each having some .fastq files for different lanes.
CND1/ UD_LOO3_R1.fastq.gz UD_LOO4_R1.fastq.gz
CND2/ XD_L001_R1.fastq.gz XD_L004_R1.fastq.gz
Inside each directory, i want to create a merged fastq file that will be named as : sample_R1.fastq.gz. For instance, CND1/UD_R1.fastq.gz and CND2/XD_R1.fastq.gz etc and so on. To this end, i created the following snakemake workflow.
from collections import defaultdict
dirs,samp,lane = glob_wildcards("dir}/{sample}_L{lane}_R1.fastq.gz")
dirs, fls = glob_wildcards("{dir}/{files}_R1.fastq.gz")
D = defaultdict(list)
for x,y in zip(dirs,fls):
D[x].append(y+'_R1.fastq.gz')
rule ML:
message:
"Merge all Lanes for Fragment R1"
input:
expand( "{dir}/{files}",zip,dir=D.keys(),files=D.values() )
output:
expand( "{dir}/{s}_R1.fastq.gz",zip,dir=dirs,s=set(samp) )
shell:
"echo {input} && echo {output} "
#"zcat {input} >> {output}"
In the code above, dict D contains directories as keys and list of fastq as values.
{ 'CND2': ['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz'], 'CND1': ['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz'] }
Doing a dry-run, snakemake complains me of missing files as follows
Missing input files for rule ML:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
I want to understand what is the correct way to provide both a directory and a list of files as input together. Any help is greatly appreciated.
Thanks.

The error clearly explains the issue. As a result of the expand function, your input is two files:
CND2/['XD_L001_R1.fastq.gz', 'XD_L004_R1.fastq.gz']
CND1/['UD_LOO3_R1.fastq.gz', 'UD_LOO4_R1.fastq.gz']
If you need the input to be four files like that:
CND1/UD_LOO3_R1.fastq.gz
CND1/UD_LOO4_R1.fastq.gz
CND2/XD_L001_R1.fastq.gz
CND2/XD_L004_R1.fastq.gz
you need to flatten your dictionary:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{dir}/{files}", zip, dir=[row[0] for row in inputs], files=[row[1] for row in inputs])
or alternatively:
inputs = [(dir, file) for dir, files in D.items() for file in files]
rule ML:
input:
expand( "{filename}", filename=[f"{dir}/{file}" for dir, file in inputs])
Overall you are overcomplicating the problem. The same can be done without this ugly juggling the lists of tuples. glob_wildcards("{filename_base}_R1.fastq.gz") should return you more convenient representation.

Related

How do you pass output from one Nextflow Channel to another and run an .Rmd file?

I have a Nextflow pipeline that has two channels.
The first channel runs and outputs 6 .tsv files to a folder called 'results'.
The second channel is supposed to use all of these 6 .tsv files and create a .pdf report using knitr in R in a process called 'createReport'.
My workflow code looks like this:
workflow {
inputFileChannel = Channel.fromPath(params.pathOfInputFile, type: 'file') // | collect | createReport // creating channel to pass in input file
findNumOfProteins(inputFileChannel) // passing in the channel to the process
findAminoAcidFrequency(inputFileChannel)
getProteinDescriptions(inputFileChannel)
getNumberOfLines(inputFileChannel)
getNumberOfLinesWithoutSpaces(inputFileChannel)
getLengthFreq(inputFileChannel)
outputFileChannel = Channel.fromPath("$params.outdir.main/*.tsv", type: 'file').buffer(size:6)
createReport(outputFileChannel)
My 'createReport' process currently looks like this:
process createReport {
module 'R/4.2.2'
publishDir params.outdir.output, mode: 'copy'
output:
path 'report.pdf'
script:
"""
R -e "rmarkdown::render('./createReport.Rmd')"
"""
}
And my 'createReport.Rmd' looks like this (tested in Rstudio and gives the correct .pdf output:
---
title: "R Markdown Practice"
author: "-"
date: "2022-12-08"
output: pdf_document
---
{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
dataSet <- list.files(path="/Users/-/Desktop/code/nextflow_practice/results/", pattern="*.tsv")
print(dataSet)
for (data in dataSet) {
print(paste("Showing the table for:", data))
targetData <- read.table(file=paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""),
head=TRUE,
nrows=5,
sep="\t")
print(targetData)
if (data == "length_data.tsv") {
data_to_graph <- read_tsv(paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""), show_col_types = FALSE)
plot(x = data_to_graph$LENGTH,y = data_to_graph$FREQ, xlab = "x-axis", ylab = "y-axis", main = "P")
}
writeLines("-----------------------------------------------------------------")
}
What would be the correct way to write the createReport process and the workflow sections so as to be able to pass the 6 .tsv outputs from the first channel into the second channel to create the report?
Sorry I am very new to Nextflow and the documentation doesn't help me as much as I would like it to!
Your outputFileChannel looks like it is trying to access files in the publishDir. The problem with accessing files in this directory (i.e. 'results') is that:
Files are copied into the specified directory in an asynchronous
manner, thus they may not be immediately available in the published
directory at the end of the process execution. For this reason files
published by a process must not be accessed by other downstream
processes.
Assuming your inputFileChannel is intended to be a value channel, you could use the following. This requires the outputs of the six process to be declared in their output blocks (using the path qualifier). We could then just mix and collect these files. Your Rmd file and list of TSV files could then be passed to your createReport process. Note that if you move your Rmd into the base directory of your pipeline project (i.e. in the same directory as your main.nf script), you can distribute it with your workflow. By providing the Rmd over a channel, this approach ensures it is staged into the process working directory when the job is run. For example:
workflow {
inputFile = file( params.pathOfInputFile )
findNumOfProteins( inputFile )
findAminoAcidFrequency( inputFile )
getProteinDescriptions( inputFile )
getNumberOfLines( inputFile )
getNumberOfLinesWithoutSpaces( inputFile )
getLengthFreq( inputFile )
Channel.empty() \
| mix( findNumOfProteins.out ) \
| mix( findAminoAcidFrequency.out ) \
| mix( getProteinDescriptions.out ) \
| mix( getNumberOfLines.out ) \
| mix( getNumberOfLinesWithoutSpaces.out ) \
| mix( getLengthFreq.out ) \
| collect \
| set { outputs }
rmd = file("${baseDir}/createReport.Rmd")
createReport( outputs, rmd )
}
process createReport {
module 'R/4.2.2'
publishDir "${params.outdir}/report", mode: 'copy'
input:
path 'input_dir/*'
path rmd
output:
path 'report.pdf'
"""
Rscript -e "rmarkdown::render('${rmd}')"
"""
}
Note that the createReport process above will stage the input TSV files under a folder called 'input_dir' in the process working directory. You could change this if you want to, but I think this keeps the working directory neat and tidy. Just be sure to modify your Rmd script to point to this folder. For example, you might choose to use something like:
dataSet <- list.files(path="./input_dir", pattern="*.tsv")
Or perhaps even:
dataSet <- list.files(pattern="*.tsv", recursive=TRUE)

Snakemake, how to change output filename when using wildcards

I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"

Copy files under list of source directories to target directories using gradle

I need to copy files under the list of directories to its corresponding directories in destination list. Say I have a list of source directories like 'A','B','C' and a list of target directories like 'X','Y','Z'. What I need to do is to copy files under A directory to be copied to X directory and from B to Y and C to Z. I have created a gradle task for this purpose. But I get an error
task copyDirs( ) {
def targetDirList = ['/target1', '/target2', '/target3'].toArray()
def sourceDirList = ['/source1', '/source2', '/source3'].toArray()
[sourceDirList,targetDirList].transpose().each {
copy{
from it[0].toString()
into it[1].toString()
}
}
}
And below is the exception I get when I try to execute it
No signature of method: org.gradle.api.internal.file.copy.CopySpecWrapper_Decorated.getAt() is applicable for argument types: (java.lang.Integer) values: [0]
Possible solutions: getAt(java.lang.String), putAt(java.lang.String, java.lang.Object), wait(), grep(), getClass(), wait(long)
It's because the it you're using is relating to the copy closure, not the values you're iterating through. Just name your element:
[sourceDirList,targetDirList].transpose().each { d ->
copy{
from d[0].toString()
into d[1].toString()
}
}

Gradle print directories in a folder - X level

I have the following folder structure under build folder (which you get during a Gradle build):
CDROM/disk1
CDROM/disk1/disk1file1a.txt
CDROM/disk1/disk1file1b.txt
CDROM/disk2/disk2file2a.txt
CDROM/disk2/disk2file2btxt
CDROM/disk2/disk2folder2x
CDROM/disk2/disk2folder2y
CDROM/disk3
CDROM/disk3/disk3
CDROM/disk3/disk33
CDROM/disk3/disk33/disk3
CDROM/folder1
CDROM/file1.txt
How can I tell Gradle to show me the following:
Print only the top level / direct child folders (only) in folder "CDROM"
i.e. it should print only disk1, disk2, disk3 and folder1
Print only the top level / direct child folders (only) which has a pattern of disk[0-9] i.e. diskX where X is a number.
i.e. it should print only disk1, disk2 and disk3
The following will do it, but I think there should be an efficient way to achieve the same and where one can define patterns and DON'T have to use "IF" statements that I have used below.
FileTree dirs = fileTree (dir: "$buildDir/CDROM", include: "disk*/**")
dirs.visit { FileVisitDetails fd ->
if (fd.directory && fd.name.startsWith('disk')){
println "------ $buildDir/CDROM_Installers/${fd.name} ---------------"
}
}
By top level if you mean only direct children of CDROM then this should be as easy as:
new File("${buildDir}/CDROM").eachDir{ if(it.name ==~/disk.*/) println it}
If you want more control on depth and other things, then you can try variations of following code:
new File("${buildDir}/CDROM").traverse( [maxDepth: 2, filter: ~/.*disk\d/,
type: groovy.io.FileType.DIRECTORIES]){
println it // or do whatever
}
see traverse for more details.

Ruby program which sorts images into different directories by their names?

I would like to make a Ruby program which sorts the images in the current directory into different subfolders, for example:
tree001.jpg, ... tree131.jpg -> to folder "tree"
apple01, ... apple20.jpg -> to folder "apple"
plum1.jpg, plum2.jpg, ... plum33.jpg -> to folder "plum"
and so on, the program should automagically recognize which files belong together by their names. I have no clue how to achive this. Till now I make a small program which collect the files with command "Dir" into an array and sort it alphabetically to help finding the appropriate classes by the file names. Does anybody have a good idea?
Check out Find:
http://www.ruby-doc.org/stdlib-2.0/libdoc/find/rdoc/Find.html
Or Dir.glob:
http://ruby-doc.org/core-2.0/Dir.html#method-c-glob
For instance:
Dir.glob("*.jpg")
will return an array that you can iterate with each.
I'd go about it something like this:
files = %w[
tree001.jpg tree03.jpg tree9.jpg
apple1.jpg apple002.jpg
plum3.jpg plum300.jpg
].shuffle
# => ["tree001.jpg", "apple1.jpg", "tree9.jpg", "plum300.jpg", "apple002.jpg", "plum3.jpg", "tree03.jpg"]
grouped_files = files.group_by{ |fn| fn[/^[a-z]+/i] }
# => {"tree"=>["tree001.jpg", "tree9.jpg", "tree03.jpg"], "apple"=>["apple1.jpg", "apple002.jpg"], "plum"=>["plum300.jpg", "plum3.jpg"]}
grouped_files.each do |grp, files|
Dir.mkdir(grp) unless Dir.exist(grp)
files.each { |f| FileUtils.mv(f, "#{grp}/#{f}") }
end
I can't test that because I don't have all the files, nor am I willing to generate them.
The important thing is group_by. It makes it easy to group the similarly named files, making it easy to walk through them.
For your case, you'll want to replace the assignment to files with Dir.glob(...) or Dir.entries(...) to get your list of files.
If you want to separate the file path from the file name, look at File.split or File.dirname and File.basename:
File.split('/path/to/foo')
=> ["/path/to", "foo"]
File.dirname('/path/to/foo')
=> "/path/to"
File.basename('/path/to/foo')
=> "foo"
Assuming every file name starts with non-digit characters followed by at least one digit character, and the initial non-digit characters define the directory you want the file moved to:
require 'fileutils'
Dir.glob("*").select{|f| File.file? f}.each do |file| # For each regular file
dir = file.match(/[^\d]*/).to_s # Determine destination directory
FileUtils.mkdir_p(dir) # Make directory if necessary
FileUtils.mv(file, dir) # Move file
end
The directories are created if necessary. You can run it again after adding files. For example, if you added the file tree1.txt later and re-ran this, it would be moved to tree/ where tree001.jpg through tree131.jpg already are.
Update: In the comments, you added the requirement that you only want to do this for files which form groups of at least 10. Here's one way to do that:
require 'fileutils'
MIN_GROUP_SIZE = 10
reg_files = Dir.glob("*").select{|f| File.file? f}
reg_files.group_by{|f| f.match(/[^\d]*/).to_s}.each do |dir, files|
next if files.size < MIN_GROUP_SIZE
FileUtils.mkdir_p(dir)
files.each do |file|
FileUtils.mv(file, dir)
end
end

Resources