Is snakemake the right tool to use for handling output mediated workflows - bioinformatics

I'm new to trying out snakemake (last week or so) in order to handle less of the small details for workflows, previously I have coded up my own specific workflow through python.
I generated a small workflow which among the steps would use Illumina PE reads and ran Kraken against them. I'd then parse the output of the Kraken output to detect the most common species (within a set of allowable) if a species value wasn't provided (running with snakemake -s test.snake --config R1_reads= R2_reads= species=''.
I have 2 questions.
What is the recommended approach given the dynamic output/input?
Currently my strategy for this is to create a temp file which
contains the detected species and then cat {input.species} it into
other shell commands. This doesn't seem elegant but looking through
the docs I couldn't quite find an adequate alternative. I noticed
PersistentDicts would let me pass variables between run: commands
but I'm unsure if I can use that to load variables into a shell:
section. I also noticed that wrappers could allow me to handle it
however from the point I need that variable on I'd be wrapping the
remainder of my workflow.
Is snakemake the right tool if I want to use the species afterwards to run a set of scripts specific to the species (with multiple species specific workflows)?
Right now my impression on how to solve this problem is to have multiple workflow files for the species and have a run with switch which calls the associated species workflow dependant on the species.
Appreciate any insight on these questions.
-Kim

You can mark output as dynamic (e.g. expecting one file per species). Then, Snakemake will determine the downstream DAG of jobs after those files have been generated. See http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files

Related

Is it possible to develop a specific flow manager in Ansible

Im trying to develop my own flow manager and even if I'm not fully familiar with Ansible, it looks like it can do the job.
I'd like to evaluate a part of concept with you and to understand if it is doable in Ansible or not. So rather than asking for a solution Im asking for suggestions about architecture.
Here are the requirements:
Flow executes on one machine.
Flow should be divided on arbitrary number of steps (it depends on project requirements) that can be executed sequentially or in parallel. Eg.
- step_0
- step_1
- step_2
step_3
- step_4
step_5
Here step_0 should be executed first, and once it is done step_1 should be launched. Having done step_1, steps 2 and 3 should start in parallel and when both of them are done, steps 4 and 5 should be run, again in parallel
Every step should be a logical wrapper around arbitrary number of commands. Eg. step_0 can execute script that makes directory skeleton, followed by commands for setting ENV VAR, followed by commands for linking. Then step_1 starts with new logical unit etc.
For every step I would like to have common generic callbacks before and after step execution. Callback requirements (again eg. for step_0):
pre_exe callback:
- create flag files:
step_0.START
step_0.RUNNING
- create log file step_0.log and redirect output content of step_0 to step_0.log
post_exe callback
- delete step_0.RUNNING
- create flag file step_0.DONE
- grep step_0.log for failing_signature (one or more strings - fail, error etc)
- grep step_0.log for passing_signature (few strings - pass, script_finished_successfully etc)
- based on results of grepping create flag files step_0.PASS (in case !FSIG & PSIG) or step_0.FAIL (in any other case)
- if step_0.FAIL is created terminate flow execution
Generally it would be good to have PSIG and FSIG, configurable on step level, but I can imagine it with hard-coded strings for all steps.
I would be happy if somebody can confirm if it is doable in Ansible or not, and if it is, to suggest high level architecture, so that I can focus my attention.

Is there a different way to create variables that don't terminate after the program ends?

Right now, I am creating files to make unterminating variables. But I'm curious if there's a simpler way to create variables that don't terminate.
I find Redis invaluable for persisting data like this. It is a quick and lightweight installation and allows you to store many types of data:
strings, including complete JSONs and binary data like JPEG/PNG/TIFF images - also with TTL (Time-to-Live) so data can be expired when no longer needed
numbers, including atomic integers, floats
lists/queues/stacks
hashes (like Python dictionaries)
sets, and sorted (ordered) sets
streams, bitfields, geospatial data and esoteric hyperlogs
PUB/SUB is also possible, where one or more machines/processes publish items and multiple consumers, who have subscribed to that topic, receive the published items.
It can also perform very fast operations on your data for you, like set intersections and unions, getting lengths of lists, moving items between lists, atomically adding/subtracting from numbers and so on.
You can also use it to pass data between processes, sub-processes, shell scripts, parent and child, child and parent (!) scripts and so on.
In addition to all that, it is networked, so you can set variables on one computer and read/alter them from another - very simply. For example, you can PUSH jobs to a queue, potentially from multiple machines, and run workers on multiple machines that wait for jobs on the queue, process them and return results to another list.
There is a discussion of the things you can store here.
Example: Store a string, then retrieve it:
redis-cli SET name fred
name=$(redis-cli GET name)
Example: Increment views of page 2 by 10, and then retrieve from different machine on network:
redis-cli INCRBY views:page:2 10
views=$(redis-cli -h 192.168.0.10 GET views:page:2)
Example: Push a value onto a list:
redis-cli LPUSH shoppingList bananas
Example: Blocking wait for next item in list - use RPOP for non-blocking:
item=$(redis-cli BRPOP shoppingList)
Also, there are bindings for Python, C/C++, Java, Ruby, PHP etc. So you can "inject" dummy/test data into, or extract debug data from a running Python program using the redis-cli tool even on a different computer.
Use environment variables to store your data.
ABC="abc"; export ABC
And the other question is, how to make environment variables persistents after reboot.
Depending on your shell, you may have different file to persist the veriables.
if using bash, run this command containing the variable's last value before reboot.
echo 'export ABD="hello"' >> $HOME/.bashrc
I think this is a good time to be using an SQL Database. It's more scalable and functional than having a fileful of "persistent variables".
It may require a little more setup, and I admit it isn't "simpler" per say, but it will probably be worth it in the long run. You will be able to do things with your variables and that may make your future scripts simpler.
I recommend going to YouTube and find a simple instruction on how to set up a local MySQL or MSSQL. There is a guy, Mike Dane, who makes really beginner-friendly instructions. Try searching "GiraffeAcademy SQL Beginner" and see if that helps you.

Does SQL*Loader have any functionality that allows for customizing the log file?

I have been asked to create a system for allowing third party companies to dump data into several of our tables. These third parties provide csv files on a periodic basis, and after doing some research it seemed like Oracle themselves had a standard tool for doing so, "sqlldr". I've since gotten it working to an acceptable degree, and we have a job scheduled to run that script once a day.
But one of the third parties supplies really dirty data, of the sort where I can't expect it to always load every row/record (looking like up to about 8% will fail). My boss asked me to forward "all output" from the first few tests to him, and like a moron I also sent the log file.
He has asked that this "report" be modified to include those exceptions that aren't unique constraints along with the line in the input file that caused the exception.
This means that I need data from the log file, but also from the (I believe) reject file in a single document. Rather than write a convoluted shell script to combine those two, does SQL*Loader itself allow any customization that might achieve the same thing? I've read through the Oracle documentation and haven't found anything that suggests this, but I've also learned not to trust it entirely either.
Is this possible? Ideally, the solution would allow me to add values to the reject file that don't exist in the original input file, but I'm also interested in any customization of the log file or reject file.
No.
I was going to stop there, but you can define the name of the log file, which might help with issue. Most automation with SQL*Loader involves wrapping it within shell scripts; aka "roll your own."

Passing variable between python scripts through shell script

I can't think of a way of doing what I am trying to do and hoping for a little advice. I am working with data on a computing cluster, and would like to process individual files on separate computing nodes. The workflow I have right now is something like the following:
**file1.py**
Get files, parameters, other info from user
Then Call: file2.sh
**file2.sh**
Submit file3.py to computing node
**file3.py**
Process input file with parameters given
What I am trying to do is call file2.sh and pass it each input data file one at a time so that there are multiple instances of file3.py running, one per file. Is there a good way to do this?
I suppose that the root of the problem is that if i were to iterate through a list of input files in file1.py I don't know how to then pass that information to file2.sh and then on to file3.py.
From this description, I'd say the the straightforward way is to call file2.sh directly from Python.
status, result = commands.getstatusoutput("file2.sh" + arg_string)
Is that enough of a start to get you moving? Are the nodes conversant enough for one to launch a command directly on another? If not, you may want to consider looking up "interprocess communication" on Linux. If they're not even on the same Internet node, you'll likely need REST commands (post and get operations), from whence things grow more overhead.

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?
You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

Resources