snakemake: Run some tasks on cluster and some locally - cluster-computing

I'm using snakemake to orchestrate an analysis pipeline. Some tasks are very small (e.g. create some symlinks) while others take hours. Is there a way to use --cluster for some of them but execute others just locally?

localrules does just this. When the example below is run with --cluster, bar jobs are run in cluster and foo jobs are run locally.
localrules: foo
rule foo:
...
rule bar:
...

Related

single Rundeck job with 2 scripts running on 2 different nodes

I have a Rundeck job with 2 scripts in it's workflow and can't figure out the ruleset to run each script individually in the job.
My 2 nodes:
Server1 and Server2
My 2 scripts are simple and examples are to check if the services are running on the servers.
Script1: gsv -name service1
Script2: gsv -name service2
Only Server1 has service1 and should only execute script1 vice versa for Server2 and script2
Right now when the job runs, it will run script1 and script2 on both servers and I'm unable to get the workflow strategy to run only specifically on each. I would like to keep this as one job to verify services on both, eventually scaling to more nodes.
You can try something like Job reference step to call multiple existing jobs in a parent job.
Using nodes at step level is not possible. You need to define 2 jobs each calling respective scripts on specific node and use a parent job to finally trigger both the steps.

Ansible Tower/AWX bug? Job task runs serially instead of parallel

I have a very generic playbook with no hard coded info whatsoever. Everything in the playbook is a variable and filled out by supplying extra vars, even host names for connections. There are no inventory files in use since the host this is run against is random usually.
On a command line in linux, I can run my ansible playbook multiple times with different variables passed and they will all run at the same time.
ansible-playbook cluster_check_build.yml -e {"host": "host1"...}
ansible-playbook cluster_check_build.yml -e {"host": "host2"...}
In tower however, if I create a job template and use the same playbook then things run serially. I call that job template multiple times using the API and passing the data as JSON. Each time I called the API to launch the job task I am supplying new extra_vars so the job task is running against different hosts. I see the jobs run serially and not parallel like from the command line.
I have 400+ hosts that need to have the same playbook run against them at random times. This playbook can take an hour or so to complete. There can be times where the playbook needs to run against 20 or 30 random hosts. Time is crucial and serial Job processing is a non starter.
Is it possible to run the same job template against different hosts in parallel? IF the answer is no then what are my options? Hopefully not creating 400+ job templates. That seems like it defeats the purpose of a very generic playbook.
I feel like an absolute fool. In the bottom right of my job template is a tiny check box that says "ENABLE CONCURRENT JOBS" <---this was the fix.
Yes, you can run templates/playbooks against multiple hosts in parallel in Tower/AWX.
These are the situations where your template will run serially:
"forks" set to 1 in your template
SERIAL=1 within your playbook
Your Tower/AWX instance is setup with only 1 fork
Your Instance is set with >1 forks but other jobs are running at the same time.

How do I run a multi-step cron job, but still make it able to execute a single step manually?

I have a data pipeline in Go with steps A, B and C. Currently those are three binaries. They share the same database but write to different tables. When developing locally, I have been just running ./a && ./b && ./c. I'm looking to deploy this pipeline to our Kubernetes cluster.
I want A -> B -> C to run once a day, but sometimes (for debugging etc.) I may just want to manually run A or B or C in isolation.
Is there a simple way of achieving this in Kubernetes?
I haven't found many resources on this, so maybe that demonstrates an issue with my application's design?
Create a docker image that holds all three binaries and a wrapper script to run all three.
Then deploy a Kubernetes CronJob that runs all three sequentially (using the wrapper script as entrypoint/command), with the appropriate schedule.
For debugging you can then just run the the same image manually:
kubectl -n XXX run debug -it --rm --image=<image> -- /bin/sh
$ ./b
...

Running bash scripts in parallel

I would like to run commands in parallel. So that if one fails, the whole job exists as failure. How can I do that? More specifically, I would like to run aws sync commands in parallel. I have 5 aws sync commands that run sequentially. I would like them to run in parallel so that if one fails, the whole job fails. How can I do that?
GNU Parallel is a really handy and powerful tool that works with anything bash
http://www.gnu.org/software/parallel/
https://www.youtube.com/watch?v=OpaiGYxkSuQ
# run lines from a file, 8 at a time
cat commands.txt | parallel --eta -j 8 "{}"

Re-use Amazon Elastic MapReduce instance

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task.
Even though I have just used the instance for 3 mins Amazon will charge for 1 hr, so I want to use the balance 57 mins to run several other tasks.
The answer is yes.
here's how you do it using the command line client:
When you create an instance pass the --alive flag, this tells emr to keep the cluster around after your job has run.
Then you can submit more tasks to the cluster:
elastic-mapreduce --jobflow <job-id> --stream --input <s3dir> --output <s3dir> --mapper <script1> --reducer <script2>
To terminate the cluster later, simply run:
elastic-mapreduce <jobid> --terminate
try running elastic-mapreduce --help to see all the commands you can run.
If you don't have the command line client, get it here.
Using:
elastic-mapreduce --jobflow job-id \
--jar s3n://some-path/x.jar \
--step-name "New step name" \
--args ...
you can also add non-streaming steps to your cluster. (just so you don't have to try it your yourself ;-) )
http://aws.amazon.com/elasticmapreduce/faqs/#dev-6
Q: Can I run a persistent job flow? Yes. Amazon Elastic MapReduce job
flows that are started with the –alive flag will continue until
explicitly terminated. This allows customers to add steps to a job
flow on demand. You may want to use this to debug your job flow logic
without having to repeatedly wait for job flow startup. You may also
use a persistent job flow to run a long-running data warehouse
cluster. This can be combined with data warehouse and analytics
packages that runs on top of Hadoop such as Hive and Pig.

Resources