Snakemake: unknown output/input files after splitting by chromosome - bioinformatics

To speed up a certain snakemake step I would like to:
split my bamfile per chromosome using
bamtools split -in sample.bam --reference
this results in files named as sample.REF_{chromosome}.bam
perform variant calling on each resulting in e.g. sample.REF_{chromosome}.vcf
recombine the obtained vcf files using vcf-concat (VCFtools) using
vcf-concat file1.vcf file2.vcf file3.vcf > sample.vcf
The problem is that I don't know a priori which chromosomes may be in my bam file. So I cannot specify accurately the output of bamtools split. Furthermore, I'm not sure how to make the input of vcf-concat to take all vcf files.
I thought of using a samples.fofn and do something like
rule split_bam:
input:
bam = "alignment/{sample}.bam",
pattern = "alignment/{sample}.REF_"
output:
alignment/anon.splitbams.fofn
log:
"logs/bamtools_split/{sample}.log"
shell:
"bamtools split -in {input.bam} -reference && \
ls alignment/{input.pattern}*.bam | sed 's/.bam/.vcf/' > {output}"
And use the same fofn for concatenating the obtained vcf files. But this feels like a very awkward hack and I'd appreciate your suggestions.
EDIT 20180409
As suggested by #jeeyem I tried the dynamic() functions, but I can't figure it out.
My complete snakefile is on GitHub, the dynamic part is at lines 99-133.
The error I get is:
InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk:
KeyError: 'anon___snakemake_dynamic'
Wildcards:
sample=anon___snakemake_dynamic
(with anon an anonymized {sample} identifier)
Running with --debug-dag gives (last parts before erroring):
candidate job cat_vcfs
wildcards: sample=anon
candidate job nanosv
wildcards: sample=anon___snakemake_dynamic, chromosome=_
candidate job samtools_index
wildcards: aligner=split_ngmlr, sample=anon___snakemake_dynamic.REF__
candidate job split_bam
wildcards: sample=anon___snakemake_dynamic, chromosome=_
InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk:
KeyError: 'anon___snakemake_dynamic'
Wildcards:
sample=anon___snakemake_dynamic
Which shows that the wildcard is misinterpreted?
Cheers,
Wouter

You can lookup the chromosome names from the bam header, or a corresponding .fai file for the used reference. This can be done at the beginning of your Snakefile. Then, you can use expand("alignment/{{sample}}.REF_{chromosome}.bam", chromosome=chromosomes) to define the output files of that rule. No need to use dynamic.

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Randomly shuffling lines in multiple text files but keeping them as separate files using a command or bash script

I have several text files in a directory. All of them unrelated. Words doesn't repeat in each file. Each line has 1 to 3 words in it such as:
apple
potato soup
vitamin D
banana
guinea pig
life is good
I know how to randomize each file:
sort -R file.txt > file-modified.txt
That's great but I want to do this in over 500+ files in a directory and it would take me ages. There must be something better.
I would like to do something like:
sort -R *.txt -o KEEP-SAME-NAME-AS-ORIGINAL-FILE-ADD-SUFFIX-TO-ALL.txt
Maybe this is possible with an script that go through each file in the directory until finished.
Very important is every file should only randomize the words within itself and do not mix with the other files.
Thank you.
Something like this one-liner:
for file in !(*-modified).txt; do shuf "$file" > "${file%.txt}-modified.txt"; done
Just loop over the files and shuffle each one in turn.
The !(*-modified).txt pattern uses bash's extended pattern matching to not match .txt files that already have -modified at the end of the name so you don't shuffle a pre-existing already shuffled output file and end up with file-modified-modified.txt. Might require a shopt -s extglob first, though that's usually turned on already in an interactive shell session.

Find & Replace Multiple Sequence Headers in Multiple FASTA Files

Here's my problem (using a Mac OS X):
I have about 35 FASTA files with 30 sequences in each one. Each FASTA file represents a gene, and they all contain the same individuals with the same sequence headers in each file. The headers are formatted as "####_G_species," with the numbers being non-sequential. I need to go through every file and change 4 specific headers, while also keeping the output as 35 discrete files with the same names as their corresponding input files, preferably depositing the outputs into a separate subdirectory.
For example: Every file contains a "6934_Sergia_sp," and I need to change
every instance of that name in all of the 35 files to "6934_R_robusta." I need to do the same with "8324_Sergestes_sp," changing every instance in every file to "8324_P_vigilax." Rinse and repeat 2 more times with different headers. After changing the headers, I need to have 35 discrete output files with the same names as their corresponding input files.
What I've found so far that seems to show the most promise is from the following link:
https://askubuntu.com/questions/84007/find-and-replace-text-within-multiple-files
using the following script:
find /home/user/directory -name \*.c -exec sed -i "s/cybernetnews/cybernet/g" {} \;
Changing the information to fit my needs, I get a script like this:
find Path/to/my/directory -name \*.fas -exec sed -i 's/6934_Sergia_sp/6934_R_robusta/g' {} \;
Running the script like that, I get and "undefined label" error. After researching,
https://www.mkyong.com/mac/sed-command-hits-undefined-label-error-on-mac-os-x/
I found that I should add '.fas' after -i giving:
find Path/to/my/directory -name \*.fas -exec sed -i '.fas' 's/6934_Sergia_sp/6934_R_robusta/g' {} \;
because on Macs you need to specify an extension for the output files. Running the script like that, I get very nearly what I'm looking for with each input file being duplicated, the correct header in each being correctly substituted for the new name, and the outputs being placed in the same directory. However, this only substitutes one header at a time, and the output files have a .fas.fas extension.
Moving forward, I would have to rename the output files to remove the second " .fas " in the extension, and rewrite and rerun the script 3 more times to get everything changed how I want it, which wouldn't be the end of the world, but definitely wouldn't be ideal.
Is it possible to set up a script so that I can run all 4 substitutions at the same time, while also exporting the outputs to a new subdirectory?
Your approach is good, but I would prefer a more verbose approach where I don't have to fight so much with the quotes. Something like:
for fasta in $(find Path/to/my/directory -name "*.fas")
do
new_fasta=$(basename $fasta .fas).new.fas
sed 's/6934_Sergia_sp/6934_R_robusta/g; s/Another_substitution/Another_result/' $fasta > $new_fasta
done
Here, you fed the list of FastA file to loop over, you compute a new fasta name (and location, if needed), and finally run sed over the input and leave the output in a new file. Observe that you can give more than one substitution in sed, separated by semicolons.
BTW, as #Ed Morton said, for the next question please, include a concise description of the problem and sample input and expected output.

bash - reading multiple input files and creating matching output files by name and sequence

I do not know much bash scripting, but I know the task I would like to do would be greatly simplified by it. I would like to test a program against expected output using many test input files.
For example, I have files named "input1.txt, input2.txt, input3.text..." and expected output in files "output1.txt, output2.txt, output3.txt...". I would like to run my program with each of the input files and output a corresponding "test1.txt, test2.txt, test3.txt...". Then I would do a "cmp output1.txt test1.txt" for each file.
So I think it would start like this.. roughly..
for i in input*;
do
./myprog.py < "$i" > someoutputthing;
done
One question I have is: how would I match the numbers in the filename? Thanks for your help.
If the input file name pattern is inputX.txt, you need to remove input from the beginning. You do not have to remove the extension, as you want to use the same for output:
output=output${i#input}
See Parameter Expansion in man bash.

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.
You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file
Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)
You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

Resources