Accepting slightly different inputs to snakemake rule (.fq vs .fq.gz) - bioinformatics

I am new to snakemake and would like to be able to take either a pair of .fq files or a pair of .fq.gz files and run them through trim_galore to get a pair of trimmed .fq.gz output files. Without giving all of my Snakefile, I have the below ugly solution where I just copied the rule and changed the inputs. What would be a better solution?
#Trim galore paired end trimming rule for unzipped fastqs:
rule trim_galore_unzipped_PE:
input:
r1=join(config['fq_in_path'], '{sample}1.fq'),
r2=join(config['fq_in_path'], '{sample}2.fq'),
output:
r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
params:
out_path=config['trim_out_path'],
conda:
'envs/biotools.yaml',
shell:
'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'
#Trim galore paired end trimming rule for gzipped fastqs:
rule trim_galore_zipped_PE:
input:
r1=join(config['fq_in_path'], '{sample}1.fq.gz'),
r2=join(config['fq_in_path'], '{sample}2.fq.gz'),
output:
r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
params:
out_path=config['trim_out_path'],
conda:
'envs/biotools.yaml',
shell:
'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

Using input functions is likely the best solution, being as follows:
Pass wildcard to input function
Using known YAML values, build the theoretical file names using that sample name.
Use python functions to check which file (file suffixes technically) are valid
Build a list of valid files
Return and unpack the list of valid files.
Notes:
Input and output should have the same wildcards, if they don't it causes issues
In the input function, make sure it cannot return a null string, as Snakemake interprets this as a "lack of input" requirement, which is not what you want.
If you adopt these suggestions, update the rule name, I forgot to.
Snakefile:
configfile: "config.yaml"
from os.path import join
from os.path import exists
rule all:
input:
expand("{trim_out_path}/{sample}.{readDirection}.fq.gz",
trim_out_path=config["trim_out_path"],
sample=config["sampleList"],
readDirection=['1','2'])
def trim_galore_input_determination(wildcards):
potential_file_path_list = []
# Cycle through both suffix possibilities:
for fastqSuffix in [".fq", ".fq.gz"]:
# Cycle through both read directions
for readDirection in ['.1','.2']:
#Build the list for ech suffix
potential_file_path = config["fq_in_path"] + "/" + wildcards.sample + readDirection + fastqSuffix
#Check if this file actually exists
if exists(potential_file_path):
#If file is legit, add to list of acceptable files
potential_file_path_list.append(potential_file_path)
# Checking for an empty list
if len(potential_file_path_list):
return potential_file_path_list
else:
return ["trim_galore_input_determination_FAILURE" + wildcards.sample]
rule trim_galore_unzipped_PE:
input:
unpack(trim_galore_input_determination)
output:
expand("{trim_out_path}/{{sample}}.{readDirection}.fq.gz",
trim_out_path=config["trim_out_path"],
readDirection=['1','2'])
params:
out_path=config['trim_out_path'],
conda:
'envs/biotools.yaml',
shell:
'trim_galore --gzip -o {params.out_path} --paired {input}'
config.yaml:
fq_in_path: input/fq
trim_out_path: output
sampleList: ["mySample1", "mySample2"]
$tree:
|-- [tboyarsk 1540 Sep 6 15:17] Snakefile
|-- [tboyarsk 82 Sep 6 15:17] config.yaml
|-- [tboyarsk 512 Sep 6 8:55] input
| |-- [tboyarsk 512 Sep 6 8:33] fq
| | |-- [tboyarsk 0 Sep 6 7:50] mySample1.1.fq
| | |-- [tboyarsk 0 Sep 6 8:24] mySample1.2.fq
| | |-- [tboyarsk 0 Sep 6 7:50] mySample2.1.fq
| | `-- [tboyarsk 0 Sep 6 8:24] mySample2.2.fq
| `-- [tboyarsk 512 Sep 6 8:55] fqgz
| |-- [tboyarsk 0 Sep 6 7:50] mySample1.1.fq.gz
| |-- [tboyarsk 0 Sep 6 8:32] mySample1.2.fq.gz
| |-- [tboyarsk 0 Sep 6 8:33] mySample2.1.fq.gz
| `-- [tboyarsk 0 Sep 6 8:32] mySample2.2.fq.gz
`-- [tboyarsk 512 Sep 6 7:55] output
$snakemake -dry (input: fg)
rule trim_galore_unzipped_PE:
input: input/fq/mySample1.1.fq, input/fq/mySample1.2.fq
output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
jobid: 1
wildcards: sample=mySample1
rule trim_galore_unzipped_PE:
input: input/fq/mySample2.1.fq, input/fq/mySample2.2.fq
output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
jobid: 2
wildcards: sample=mySample2
localrule all:
input: output/mySample1.1.fq.gz, output/mySample2.1.fq.gz, output/mySample1.2.fq.gz, output/ mySample2.2.fq.gz
jobid: 0
Job counts:
count jobs
1 all
2 trim_galore_unzipped_PE
3
$snakemake -dry (input: fgqz)
rule trim_galore_unzipped_PE:
input: input/fqgz/mySample1.1.fq.gz, input/fqgz/mySample1.2.fq.gz
output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
jobid: 1
wildcards: sample=mySample1
rule trim_galore_unzipped_PE:
input: input/fqgz/mySample2.1.fq.gz, input/fqgz/mySample2.2.fq.gz
output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
jobid: 2
wildcards: sample=mySample2
localrule all:
input: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz, output/mySample2.1.fq.gz, output/ mySample2.2.fq.gz
jobid: 0
Job counts:
count jobs
1 all
2 trim_galore_unzipped_PE
3
There are ways to make it more generic, but since you declare and use the YAML config to build most of the file name, I will avoid discussing it in the answer. Just saying it's possible and somewhat encouraged.
The "--paired {input}" will expand to provide both files. Because of the for-loop, the 1 will always come before the 2.

Related

How to iterate through multiple directories with multiple ifs in bash?

unfortunately I'm quite new at bash, and I want to write a script that will start in a main directory, and check all subdirectories one by one for the presence of certain files, and if those files are present, perform an operation on them. For now, I have written a simplified version to test whether I can do the first part (checking for the files in each directory). This code runs without any errors that I can tell, but it does not echo anything to say that it has successfully found the files which I know are there.
#!/bin/bash
runlist=(1 2 3 4 5 6 7 8 9)
for f in *; do
if [[ -d {$f} ]]; then
#if f is a directory then cd into it
cd "{$f}"
for b in $runlist; do
if [[ -e "{$b}.png" ]]; then
echo "Found {$b}"
#if the file exists then say so
fi
done
cd -
fi
done
'''
Welcome to stackoverflow.
The following will do the trick (a combination of find, array, and if then else):
# list of files we are looking for
runlist=(1 2 4 8 16 32 64 128)
#find each of above anywhere below current directory
# using -maxdepth 1 because, based on on your exam you want to look one level only
# if that's not what you want then take out -maxdepth 1 from the find command
for b in ${runlist[#]}; do
echo
PATH_TO_FOUND_FILE=`find . -name $b.png`
if [ -z "$PATH_TO_FOUND_FILE" ]
then
echo "nothing found" >> /dev/null
else
# You wanted a postive confirmation, so
echo found $b.png
# Now do something with the found file. Let's say ls -l: change that to whatever
ls -l $PATH_TO_FOUND_FILE
fi
done
Here is an example run:
mamuns-mac:stack foo$ ls -lR
total 8
drwxr-xr-x 4 foo 1951595366 128 Apr 11 18:03 dir1
drwxr-xr-x 3 foo 1951595366 96 Apr 11 18:03 dir2
-rwxr--r-- 1 foo 1951595366 652 Apr 11 18:15 find_file_and_do_something.sh
./dir1:
total 0
-rw-r--r-- 1 foo 1951595366 0 Apr 11 17:58 1.png
-rw-r--r-- 1 foo 1951595366 0 Apr 11 17:58 8.png
./dir2:
total 0
-rw-r--r-- 1 foo 1951595366 0 Apr 11 18:03 64.png
mamuns-mac:stack foo$ ./find_file_and_do_something.sh
found 1.png
-rw-r--r-- 1 foo 1951595366 0 Apr 11 17:58 ./dir1/1.png
found 8.png
-rw-r--r-- 1 foo 1951595366 0 Apr 11 17:58 ./dir1/8.png
found 64.png
-rw-r--r-- 1 foo 1951595366 0 Apr 11 18:03 ./dir2/64.png

Csh - Fetching fields via awk inside xargs

I'm struggling to understand this behavior:
Script behavior: read a file (containing dates); print a list of files in a multi-level directory tree and get their size, print the file size only, (future step: sum the overall file size).
Starting script:
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | head"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
[and so on]
But when I try to filter via awk on the first field, I still get the whole line
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print $1}'"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
I've already approached it via divide-et-impera, and the following command works just fine:
du -d 2 "/folder/" | grep '2000-03' | awk '{print $1}'
1000
I'm afraid that I'm missing something very trivial, but I haven't found anything so far.
Any idea? Thanks!
Input: directory containing folders named YYYY-MM-random_data and a file containing strings:
ls -l
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-03-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-04-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-05-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablb
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablc
[...]
cat dates
2000-03
2000-04
2000-05
[...]
Expected output: sum of the disk space occupied by all the files contained in the folder whose name include the string in the file dates
2000-03: 1000
2000-04: 2123
2000-05: 1222112
[...]
======
But in particular, I'm interested in why awk is not able to fetch the column $1 I asked it to.
Ok it seems I found the answer myself after a lot of research :D
I'll post it here, hoping that it will help somebody else out.
https://unix.stackexchange.com/questions/282503/right-syntax-for-awk-usage-in-combination-with-other-command-inside-xargs-sh-c
The trick was to escape the $ sign.
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print \$1}'"
Using GNU Parallel it looks like this:
parallel --tag "eval du -s folder/{}* | perl -ne '"'$s+=$_ ; END {print "$s\n"}'"'" :::: dates
--tag prepends the line with the date.
{} is replaced with the date.
eval du -s folder/{}* finds all the dirs starting with the date and gives the total du from those dirs.
perl -ne '$s+=$_ ; END {print "$s\n"}' sums up the output from du
Finally there is bit of quoting trickery to get it quoted correctly.

How to copy a directory in a way that all files replaced by zero size files with the same filename

For example, I have a directory called aaa, and there are files named 11.txt,22.exe,33.mp3under aaa directory.
Now I want to make a mirror of this directory, but I only want the structure of the directory not the contents, That is to copy aaa to bbb. Files under bbb should be 11.txt.zero,22.exe.zero,33.mp3.zero. The .zero extension indicates that all the copied file should be of zero file size.
The directory generally contains subfolers.
How to achieve this using windows CMD and linux bash?
Slightly better:
Being in the directory you want to copy:
Make the directories:
find -type d -exec mkdir -p ../new_dir/{} \;
Touch the files:
find -type f -exec touch ../new_dir/{}.zero \;
Original approach
I would do it in two steps:
Copy all files with rsync -r:
rsync -r origin_dir/ target_dir/
Loop through the files in target_dir/, empty them and rename to .zero:
find target_dir/ -type f -exec sh -c 'f={}; > $f; mv $f $f.zero' \;
This of course not very optimal if the files in the original dir are quite big, because we first copy and then empty them.
Test
$ tree aa
aa
├── a
├── a1
│   └── d
├── b
├── c
└── d
1 directory, 5 files
$ ll aa/*
-rw-r--r-- 1 me me 6 Jun 12 11:16 aa/a
-rw-r--r-- 1 me me 6 Jun 12 11:16 aa/b
-rw-r--r-- 1 me me 6 Jun 12 11:16 aa/c
-rw-r--r-- 1 me me 6 Jun 12 11:16 aa/d
aa/a1:
total 0
-rw-r--r-- 1 me me 0 Jun 12 11:16 d
$ rsync -r aa/ bb/
$ tree bb
bb
├── a
├── a1
│   └── d
├── b
├── c
└── d
1 directory, 5 files
$ find bb/ -type f -exec sh -c 'f={}; > $f; mv $f $f.zero' \;
$ tree bb
bb
├── a1
│   └── d.zero
├── a.zero
├── b.zero
├── c.zero
└── d.zero
1 directory, 5 files
$ ll bb/*
-rw-r--r-- 1 me me 0 Jun 12 11:23 bb/a.zero
-rw-r--r-- 1 me me 0 Jun 12 11:23 bb/b.zero
-rw-r--r-- 1 me me 0 Jun 12 11:23 bb/c.zero
-rw-r--r-- 1 me me 0 Jun 12 11:23 bb/d.zero
bb/a1:
total 0
-rw-r--r-- 1 me me 0 Jun 12 11:23 d.zero
#echo off
setlocal enableextensions disabledelayedexpansion
set "source=c:\somewhere\aaaa"
set "target=x:\anotherplace\bbbb"
robocopy "%source%\." "%target%\." /s /e /create
for /r "%target%" %%a in (*) do ren "%%~fa" "%%~nxa.zero"
Here the robocopy command will handle the folder recursion and zero file generation. Once this is done, all the files in the target structure are renamed.
at first copy directory structure:
find aaa -type d -print | sed 's/^aaa/bbb/' | xargs mkdir -p
at second, touch the files:
find aaa -type f -print | sed 's/^aaa/bbb/' | sed 's/$/\.zero/' | xargs touch
and result:
[trak_000.COGITATOR] ➤ ls -lR aaa bbb
aaa:
total 0
drwxrwxr-x 1 trak_000 UsersGrp 0 Jun 12 12:20 q1
aaa/q1:
total 1
-rw-rw-r-- 1 trak_000 UsersGrp 194 Jun 12 12:20 test1.txt
bbb:
total 0
drwxrwxr-x 1 trak_000 UsersGrp 0 Jun 12 12:24 q1
bbb/q1:
total 0
-rw-rw-r-- 1 trak_000 UsersGrp 0 Jun 12 12:24 test1.txt.zero
Use tree command to get recursive list of source folder and then to create zero size structure use mkdir for subdirs and touch for files.

Split a .txt file based on content

I have a huge *.txt file as follows:
~~~~~~~~ small file content 1 <br>
~~~~~~~~ small file content 2 <br>
...
~~~~~~~~ small file content n <br>
How do I split this into n files, preferably via bash?
Use csplit
$ csplit --help
Usage: csplit [OPTION]... FILE PATTERN...
Output pieces of FILE separated by PATTERN(s) to files `xx00', `xx01', ...,
and output byte counts of each piece to standard output.
With awk:
awk 'BEGIN {c=1} NR % 10000 == 0 { c++ } { print $0 > ("splitfile_" c) }' LARGEFILE
will do. It sets up a counter which will be incremented on every 10000 line. Then writes the lines to ˙splitfile_` file(s).
HTH
If the content of your HUGE text file is on every line (i.e each line contains content that you would like to split then this should work) -
One-liner:
awk '{print >("SMALL_BATCH_OF_FILES_" NR)}' BIG_FILE
Test:
[jaypal:~/Temp] cat BIG_FILE
~~~~~~~~ small file content 1
~~~~~~~~ small file content 2
~~~~~~~~ small file content 3
~~~~~~~~ small file content 4
~~~~~~~~ small file content n-1
~~~~~~~~ small file content n
[jaypal:~/Temp] awk '{print >("SMALL_BATCH_OF_FILES_" NR)}' BIG_FILE
[jaypal:~/Temp] ls -lrt SMALL_BATCH_OF_FILES_*
-rw-r--r-- 1 jaypalsingh staff 30 17 Dec 14:19 SMALL_BATCH_OF_FILES_6
-rw-r--r-- 1 jaypalsingh staff 32 17 Dec 14:19 SMALL_BATCH_OF_FILES_5
-rw-r--r-- 1 jaypalsingh staff 30 17 Dec 14:19 SMALL_BATCH_OF_FILES_4
-rw-r--r-- 1 jaypalsingh staff 30 17 Dec 14:19 SMALL_BATCH_OF_FILES_3
-rw-r--r-- 1 jaypalsingh staff 30 17 Dec 14:19 SMALL_BATCH_OF_FILES_2
-rw-r--r-- 1 jaypalsingh staff 30 17 Dec 14:19 SMALL_BATCH_OF_FILES_1
[jaypal:~/Temp] cat SMALL_BATCH_OF_FILES_1
~~~~~~~~ small file content 1
[jaypal:~/Temp] cat SMALL_BATCH_OF_FILES_2
~~~~~~~~ small file content 2
[jaypal:~/Temp] cat SMALL_BATCH_OF_FILES_6
~~~~~~~~ small file content n

Divide files depending on X and put into individual folder

How do I create a script in a command prompt (or shell script in a bash) to divide files depending on a number, say X, and put into individual folder.
Example: I have 10 files and the number X is 4 (I can set it inside the script). So, the system will create 3 folders (1st folder contain 4 files, 2nd folder contain 4 files and the last folder will contain the remaining 2 files) after running the script.
Regarding about the divide of the files. It can be either go by the date or the filename.
Example: Suppose the 10 files above is a.txt, aa.txt, b.txt, cd.txt, ef.txt, g.txt, h.txt, iii.txt, j.txt and zzz.txt. After running the script, it will create the 3 folders such that the 1st folder contain a.txt, aa.txt, b.txt, cd.txt, the 2nd folder contains ef.txt, g.txt, h.txt, iii.txt and the last folder will contain the remaining files - j.txt and zzz.txt
Based on your description, an awk one liner could achieve your goal.
check the example below, you can change "4" in xargs -n parameter with your X:
kent$ l
total 0
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 01.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 02.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 03.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 04.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 05.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 06.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 07.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 08.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 09.txt
-rw-r--r-- 1 kent kent 0 2011-09-27 11:04 10.txt
kent$ ls|xargs -n4|awk ' {i++;system("mkdir dir"i);system("mv "$0" -t dir"i)}'
kent$ tree
.
|-- dir1
| |-- 01.txt
| |-- 02.txt
| |-- 03.txt
| `-- 04.txt
|-- dir2
| |-- 05.txt
| |-- 06.txt
| |-- 07.txt
| `-- 08.txt
`-- dir3
|-- 09.txt
`-- 10.txt
#!/usr/bin/env bash
dir="${1-.}"
x="${2-4}"
let n=0
let sub=0
while IFS= read -r file ; do
if [ $(bc <<< "$n % $x") -eq 0 ] ; then
let sub+=1
mkdir -p "subdir$sub"
n=0
fi
mv "$file" "subdir$sub"
let n+=1
done < <(find "$dir" -maxdepth 1 -type f)
Works even when files have spaces and other special characters in their names.

Resources