Snakemake slowing down exponentially due to large amount of files being processed - performance

I am currently writing a pipeline, which generates positive RNA sequences, shuffles them and afterwards analyses the positive sequences and the shuffled (negative) sequences. For example, I want to generate 100 positve sequences and shuffle EACH of those sequences 1000 times with three different algorithms. For this purpose, I utilize two wildcards (pos_index and pred_index) ranging from 0 to 100 and 0 to 1000, respectively. As a last step all files are analysed by yet another three different tools.
Now to my problem: The building process of the DAG takes literally hours, which is followed from an even slower execution of the actual pipeline. When it starts, it executes a batch of 32 jobs (because I allocated 32 cores for snakemake) and then takes 10 to 15 minutes to execute the next batch (due to some file checks, i guess). The complete execution of the pipeline would take approximately 2 months.
Below is a simplified example of my snakefile. Is there any way, I can optimise this in a way, so snakemake and its overhead is not the bottleneck anymore?
ITER_POS = 100
ITER_PRED = 1000
SAMPLE_INDEX = range(0, ITER_POS)
PRED_INDEX = range(0, ITER_PRED)
SHUFFLE_TOOLS = ["1", "2", "3"]
PRED_TOOLS = ["A", "B", "C"]
rule all:
input:
# Expand for negative sample analysis
expand("predictions_{pred_tool}/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt",
pred_tool = PRED_TOOLS,
shuffle_tool = SHUFFLE_TOOLS,
sample_index = SAMPLE_INDEX,
pred_index = PRED_INDEX),
# Expand for positive sample analysis
expand("predictions_{pred_tool}/pos_sample_{sample_index}.txt",
pred_tool = PRED_TOOLS,
sample_index = SAMPLE_INDEX)
# GENERATION
rule generatePosSample:
output: "samples/pos_sample_{sample_index}.clu"
shell: "sequence_generation.py > {output}"
# SHUFFLING
rule shufflePosSamples1:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_1_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples2:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_2_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples3:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_3_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
# ANALYSIS
rule analysePosSamplesA:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_A/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_A.py {input} > {output}"
rule analysePosSamplesB:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_B/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_B.py {input} > {output}"
rule analysePosSamplesC:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_C/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_C.py {input} > {output}"
rule analyseNegSamplesA:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_A/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_A.py {input} > {output}"
rule analyseNegSamplesB:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_B/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_B.py {input} > {output}"
rule analyseNegSamplesC:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_C/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_C.py {input} > {output}"

Even though I did not really have a large amount of files being processed, and also I did not experience a slowed down execution time, I did experience also a significant slow down of my DAG computation step.
Hence, I want to share my solution:
In cases where you refer in your input to an output of another rule, simply make usage of snakemake's in-build functionality of rule dependencies and rule referencing:
### Bad example
rule bad_example_rule:
input:
"output_from_previous_rule.txt"
output:
"output.txt"
shell:
"touch {output[0]}"
### Solution
rule solution_example_rule:
input:
rules.previous_rule_name.output[0]
output:
"output.txt"
shell:
"touch {output[0]}"
I do not know why, but for me it speeded up the DAG build process by at least x100.

Related

combine permutations of .wav files

I am trying to combine permutations of some .wav files.
There's 6 variations of 4 instruments. Each generated track should have one of each instrument. If my math is right, there should be 24 unique permutations.
The files are named like:
beat_1.wav, beat_2.wav ...
bass_1.wav, bass_2.wav ...
chord_1.wav, chord_2.wav ...
melody_1.wav, melody_2.wav ...
I've tried to combine them with
sox -m {beat,bass,chord,melody}_{1..6}.wav out_{1..24}.wav
but regardless of what range of values I use for the out_n.wav file, sox gives this error
immediately:
sox FAIL formats: can't open input file `out_23.wav': No such file or directory
The number in out_23.wav is always one lower than whatever range I specify.
I'm open to using tools other than sox and bash, provided I can generate all the tracks in one command/program (I don't want to do it by hand in Audacity, for example).
If you replace sox with echo you will see the command you constructed is not really permutating the way you want:
$ echo sox -m {beat,bash,chord,melody}_{1..6}.wav out_{1..24}.wav
sox -m beat_1.wav beat_2.wav beat_3.wav beat_4.wav beat_5.wav beat_6.wav bash_1.wav bash_2.wav bash_3.wav bash_4.wav bash_5.wav bash_6.wav chord_1.wav chord_2.wav chord_3.wav chord_4.wav chord_5.wav chord_6.wav melody_1.wav melody_2.wav melody_3.wav melody_4.wav melody_5.wav melody_6.wav out_1.wav out_2.wav out_3.wav out_4.wav out_5.wav out_6.wav out_7.wav out_8.wav out_9.wav out_10.wav out_11.wav out_12.wav out_13.wav out_14.wav out_15.wav out_16.wav out_17.wav out_18.wav out_19.wav out_20.wav out_21.wav out_22.wav out_23.wav out_24.wav
So what we see is there are 24 combinations for your input as required, but, it is also supplying 24 outputs on the same line, which, according to the documentation of sox, all inputs are treated as input except for the last, so, files out_1.wav ... out23.wav will also be treated as input not outputs. So, you have a logic problem.
If you want to permutate through all 24 combinations, one at a time, I recommend a for loop, e.g.
i=0
for f in {beat,bass,chord,melody}_{1..6}.wav
do
((i++))
echo "Input: " $f "Output: out_${i}.wav"
done
Which outputs:
Input: beat_1.wav Output: out_1.wav
Input: beat_2.wav Output: out_2.wav
Input: beat_3.wav Output: out_3.wav
Input: beat_4.wav Output: out_4.wav
Input: beat_5.wav Output: out_5.wav
Input: beat_6.wav Output: out_6.wav
Input: bass_1.wav Output: out_7.wav
Input: bass_2.wav Output: out_8.wav
Input: bass_3.wav Output: out_9.wav
Input: bass_4.wav Output: out_10.wav
Input: bass_5.wav Output: out_11.wav
Input: bass_6.wav Output: out_12.wav
Input: chord_1.wav Output: out_13.wav
Input: chord_2.wav Output: out_14.wav
Input: chord_3.wav Output: out_15.wav
Input: chord_4.wav Output: out_16.wav
Input: chord_5.wav Output: out_17.wav
Input: chord_6.wav Output: out_18.wav
Input: melody_1.wav Output: out_19.wav
Input: melody_2.wav Output: out_20.wav
Input: melody_3.wav Output: out_21.wav
Input: melody_4.wav Output: out_22.wav
Input: melody_5.wav Output: out_23.wav
Input: melody_6.wav Output: out_24.wav

Script to create a four character string based on known numerical relationships

Consider a three line input file containing four unique numbers (1,2,3,4) such that each line represents the position of one number relative to another number.
So for example in the following input set, 4 is next to 2, 2 is next to 3, and 1 is next to 4.
42
23
14
So given that how would a script assemble all four numbers in such a way that it maintains each numbers known relationship?
In other words there are two answers 1423 or 3241 but how to arrive at that programmatically?
Not very sensible or efficient, but fun (for me, at least) :-)
This will echo all the permutations using GNU Parallel:
parallel echo {1}{2}{3}{4} ::: {1..4} ::: {1..4} ::: {1..4} ::: {1..4}
And add some grepping on the end:
parallel echo {1}{2}{3}{4} ::: {1..4} ::: {1..4} ::: {1..4} ::: {1..4} | grep -E "42|24" | grep -E "23|32" | grep -E "14|41"
Output
1423
3241
Brute forcing the luck:
for (( ; ; ))
do
res=($(echo "42
23
14" | shuf))
if ((${res[0]}%10 == ${res[1]}/10 && ${res[1]}%10 == ${res[2]}/10))
then
echo "success: ${res[#]}"
break
fi
echo "fail: ${res[#]}"
done
fail: 42 14 23
fail: 42 23 14
fail: 42 14 23
success: 14 42 23
For 3 numbers, this approach is acceptable.
Shuf shuffles the input lines and fills the array res with the numbers.
Then we take to following numbers and test, if the last digit of the first matches the first digit of the next, and for the 2nd and 3rd number accordingly.
If so, we break with a success message. For debugging, a failure message is better than a silent endless loop.
For longer chains of numbers, a systematic permutation might be better to test and a function to check two following numbers, which can be called by index or better a loop would be suitable.

Joining lines, modulo the number of records

Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

How to extract 100 numbers at a time from a text file?

I am absolutely new to bash scripting but I need to perform some task with it. I have a file with just one column of numbers (6250000). I need to extract 100 at a time, put them into a new file and submit each 100 to another program. I think this should be some kind of a loop going through my file each 100 numbers and submitting them to the program.
Let's say my numbers in the file would look like this.
1.6435
-1.2903
1.1782
-0.7192
-0.4098
-1.7354
-0.4194
0.2427
0.2852
I need to feed each of those 62500 output files to a program which has a parameter file. I was doing something like this:
lossopt()
{
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
}
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out > ./lossin.out
echo temp.par | ./lossopt >> lossopt.out
rm lossin.out
cut -d " " -f 101- rearnum.out > rearnum.out
done
rearnum is my big initial file
If you need to split it into files containing 100 lines each, I'd use split -l 100 <source> which will create a lot of files named like xaa, xab, xac, ... each of which contain at most 100 lines of the source file (the last file may contain fewer). If you want the names to start with something other than x you can give the prefix those names should use as the last argument to split as in split -l 100 <source> OUT which will now give files like OUTaa, OUTab, ...
Then you can loop over those files and process them however you like. If you need to run a script with them you could do something like
for file in OUT*; do
<other_script> "$file"
done
You can still use a read loop and redirection:
#!/bin/bash
fnbase=${1:-file}
increment=${2:-100}
declare -i count=0
declare -i fcount=1
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
while read -r line; do
((count == 0)) && :> "$fname"
((count++))
echo "$line" >> "$fname"
((count % increment == 0)) && {
count=0
((fcount++))
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
}
done
exit 0
use/output
$ bash script.sh yourprefix <yourfile
Which will take yourfile with many thousands of lines and write every 100 lines out to yourprefix_00000001 -> yourprefix_99999999 (default is file_000000001, etc..). Each new filename is truncated to 0 lines before writing begins.
Again you can specify on the command line the number of lines to write to each file. E.g.:
$ bash script.sh yourprefix 20 <yourfile
Which will write 20 lines per file to yourprefix_00000001 -> yourprefix_99999999
Even though it may seem stupid for professional in bash, I will take the risk and post my own answer to my question
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out >> ./lossin.out
echo temp.par | ./lossopt >> sdis.out
rm lossin.out
tail -n +101 rearnum.out > temp
tail -n +1 temp > rearnum.out
rm temp
done
This script consequentially "eats" the big initial file and puts the "pieces" into the external program. After it takes one portion of 100 number, it deletes this portion from the big file. Then, the process repeats until the big file is empty. It is not an elegant solution but it worked for me.

Resources