I often find when adding rules to my workflow that I need to split large jobs up into batches. This means that my input/output files will branch out across temporary sets of batches for some rules before consolidating again into one input file for a later rule. For example:
rule all:
input:
expand("final_output/{sample}.counts",sample=config["samples"]) ##this final output relates to blast rule in that it will feature a column defining transcript type
...
rule batch_prep:
input: "transcriptome.fasta"
output:expand("blast_input_{X}.fasta",X=[1,2,3,4,5])
script:"scripts/split_transcriptome.sh"
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast.txt"
script:"scripts/blastx.sh"
...
rule rsem:
input:
"transcriptome.fasta",
"{sample}.fastq"
output:
"final_output/{sample}.counts"
script:
"scripts/rsem.sh"
In this simplified workflow, snakemake -n would show a separate rsem job for each sample (as expected, from wildcards set in rule all). However, blast would give a WildcardError stating that
Wildcards in input files cannot be determined from output files:
'X'
This makes sense, but I can't figure out a way for the Snakefile to submit separate jobs for each of the 5 batches above using the one blast template rule. I can't make separate rules for each batch, as the number of batches will vary on the size of the dataset. It seems it would be useful if I could define wildcards local to a rule. Does such a thing exist, or is there a better way to solve this issue?
I hope I understood your problem correctly, if not, feel free to correct me:
So, you want to call the rule blast for every "blast_input_{X}.fasta"?
Then, the batch wildcard would need to be carried over into the output.
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast_{X}.txt"
script:"scripts/blastx.sh"
If you then later want to merge the batches again in another rule, just use expand in the input of that rule.
input: expand("output_blast_{X}.txt", X=your_batches)
output: "merged_blast_output.txt"
Related
I'm writing a help output for a Bash script. Currently it looks like this:
dl [m|r]… (<file>|<URL> [m|r|<index>]…)…
The meaning that I'm trying to convey (and elsewhere describe with words) is that (after a potential "m" and/or "r") there can be an endless list of sets of arguments. The first argument in each set is always a file or URL and the further arguments can each be "m", "r" or a number. After that, it starts over with a file or URL and so on.
In my special case, I could just write this:
dl [m|r]… (<file>|<URL>) (<file>|<URL>|m|r|<index>)…
This works, because listing a URL and then another URL with nothing in between is allowed, as well as listing an arbitrarily long chain of "m"s (it's just useless to do so) and pretty much any other combination.
But what if that wasn't the case? What if I had for example a command like this:
change (<from> <to>)…
…which would be used e.g. like this:
change from1 to1 from2 to2 from3 to3
Would the bracket syntax be correct here? I just guessed it based on the grouping of (a|b), but I wasn't able to find any documentation that uses this for multiple, non-exclusive arguments that belong together. Is there even a standard for this?
I'm starting out with makefile and I am a bit puzzled about how patterns work. I have multiple different targets, each with a name-matching prerequisite. I would like to have a variable storing all the "stems" of the targets and prerequisities at the top, and then just adding the prefix/suffix and a common recipe for all of them. So far I have tried:
names = stem1 stem2 stem3
all: $(names:%=dir/prefix_%.txt) $(names:%=dir/another_%.txt)
$(names:%=dir/prefix_%.txt): $(names:%=sourcedir/yetanother_%.xlsx)
echo $#
echo prerequisite_with_the_same_stem_as_current_target
Even though this makes all the targets one by one, the prerequisities for each target are listed all, not just the one that matches with the current %(names) of the target. The reason I need it to match is because I then supply the current target and its single prerequisite to a script, which then makes the target. How to pattern-match each prerequisite with its one target?
The misconception that you have is about how make handles lists. If you have a variable:
names = stem1 stem2 stem3
then make handles this as a list but instantiates the whole list contents all at once every time you name this variable. It does not do a one-per-one operation on list contents, because that would be close to uncontrollable, depending on the situation. Instead it resorts to simple text replacement, thus your line
all: $(names:%=dir/prefix_%.txt) $(names:%=dir/another_%.txt)
is parsed&variable-replaced very simple into a string:
all: dir/prefix_stem1.txt dir/prefix_stem2.txt dir/prefix_stem3.txt ...etc...
The iterative list handling happens only within $(names:%=dir/prefix_%.txt) and so on, while the line itself, after variable-replacement, just is text which is fed to the second parsing step.
Along the same line your rule:
$(names:%=dir/prefix_%.txt): $(names:%=sourcedir/yetanother_%.xlsx)
expands to
dir/prefix_stem1.txt dir/prefix_stem2.txt dir/prefix_stem3.txt: sourcedir/yetanother_stem1.xlsx sourcedir/yetanother_stem2.xlsx sourcedir/yetanother_stem3.xlsx
which is a short-hand notation for the three rules:
dir/prefix_stem1.txt: sourcedir/yetanother_stem1.xlsx sourcedir/yetanother_stem2.xlsx sourcedir/yetanother_stem3.xlsx
dir/prefix_stem2.txt: sourcedir/yetanother_stem1.xlsx sourcedir/yetanother_stem2.xlsx sourcedir/yetanother_stem3.xlsx
dir/prefix_stem3.txt: sourcedir/yetanother_stem1.xlsx sourcedir/yetanother_stem2.xlsx sourcedir/yetanother_stem3.xlsx
and nothing else. Obviously you told make that each target depends on all of the prerequisites.
With a little tweaking and Static Pattern Rules you can achiev your goal, though:
MY_TARGETS := $(names:%=dir/prefix_%.txt) # create full target names
$(MY_TARGETS) : dir/prefix_%.txt : sourcedir/yetanother_%.xslx
I study genetic data from 288 fish samples (Fish_one, Fish_two ...)
I have four files per fish, each with a different suffix.
eg. for sample_name Fish_one:
file 1 = "Fish_one.1.fq.gz"
file 2 = "Fish_one.2.fq.gz"
file 3 = "Fish_one.rem.1.fq.gz"
file 4 = "Fish_one.rem.2.fq.gz"
I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?
cp sample_name.1.fq.gz sample_name.fq.gz
cat sample_name.2.fq.gz >> sample_name.fq.gz
cat sample_name.rem.1.fq.gz >> sample_name.fq.gz
cat sample_name.rem.2.fq.gz >> sample_name.fq.gz
In the end, I would have only one file per sample, ideally in a different folder.
I would be very grateful to receive a bit of help on this one, even though I'm sure the answer is quite simple for a non-novice!
Many thanks,
Noé
I would like to apply the following concatenate instructions to all my
samples, using maybe a text file containing a list of all the
sample_name, that would be provided to a loop?
In the first place, the name of the cat command is mnemonic for "concatentate". It accepts multiple command-line arguments naming sources to concatenate together to the standard output, which is exactly what you want to do. It is poor form to use a cp and three cats where a single cat would do.
In the second place, although you certainly could use a file of name stems to drive the operation you describe, it's likely that you don't need to go to the trouble to create or maintain such a file. Globbing will probably do the job satisfactorily. As long as there aren't any name stems that need to be excluded, then, I'd probably go with something like this:
for f in *.rem.1.fq.gz; do
stem=${f%.rem.1.fq.gz}
cat "$stem".{1,2,rem.1,rem.2}.fq.gz > "${other_dir}/${stem}.fq.gz"
done
That recognizes the groups present in the current working directory by the members whose names end with .rem.1.fq.gz. It extracts the common name stem from that member's name, then concatenates the four members to the correspondingly-named output file in the directory identified by ${other_dir}. It relies on brace expansion to form the arguments to cat, so as to minimize code and (IMO) improve clarity.
Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)
This is something that has puzzled me for some time and I have yet to find an answer.
I am in a situation where I am applying a standardized data cleaning process to (supposedly) similarly structured files, one file for each year. I have a statement such as the following:
replace field="Plant" if field=="Plant & Machinery"
Which was a result of the original code-writing based on the data file for year 1. Then I generalize the code to loop through the years of data. The problem becomes if in year 3, the analogous value in that variable was coded as "Plant and MachInery ", such that the code line above would not make the intended change due to the difference in the text string, but not result in an error alerting the change was not made.
What I am after is some sort of confirmation that >0 observations actually satisfied the condition each instance the code is executed in the loop, otherwise return an error. Any combination of trimming, removing spaces, and standardizing the text case are not workaround options. At the same time, I don't want to add a count if and then assert statement before every conditional replace as that becomes quite bulky.
Aside from going to the raw files to ensure the variable values are standardized, is there any way to do this validation "on the fly" as I have tried to describe? Maybe just write a custom program that combines a count if, assert and replace?
The idea has surfaced occasionally that replace should return the number of observations changed, but there are good reasons why not, notably that it is not a r-class or e-class command any way and it's quite important not to change the way it works because that could break innumerable programs and do-files.
So, I think the essence of any answer is that you have to set up your own monitoring process counting how many values have (or would be) changed.
One pattern is -- when working on a current variable:
gen was = .
foreach ... {
...
replace was = current
replace current = ...
qui count if was != current
<use the result>
}