How to suppress verbose list of remaining jobs if Snakemake job failed? - bioinformatics

Since about version 7.0.0 of Snakemake, when a Snakemake job fails, a rather verbose list of remaining outputs is printed to the terminal.
This can be useful, but when you have a large workflow with thousands of jobs, it makes it really hard to see which job failed and why.
I would hence like to suppress this Remaining jobs output. How do I do this?
I had a look at the snakemake docs but couldn't find anything.
This is the section that I would like to get rid of through CLI flag/config, if possible.

Related

Producing comprehensible logs for shell script ran through xargs

I am currently reindexing Elasticsearch indices in a shell script using curl to do REST API calls.
To improve performance, I am running this script with xargs over 10 processes.
However, all of my scripts are outputting into a single log, making this log useless/incomprehensible, since output is being written async which makes the log unordered.
The client would like to know progress made on reindexing by looking at logs (i.e they want to know if 50/100 index is done when they look at the log). And we would like to have the curl outputs etc for debugging purposes.
What are some ways I can make comprehensible logs?
You could try specifying a separate log file for each process, then concatenating the logs at the end.
But if you really want them all to access the same logfile, then I'm afraid your only way to do that is with a lock, which is not easy with bash/xargs.

How to monitor and control background processes in shell script

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Correct use Bash wait command with unknown processes number

Im writing a bash script that essentially fires off a python script that takes roughly 10 hours to complete, followed by an R script that checks the outputs of the python script for anything I need to be concerned about. Here is what I have:
ProdRun="python scripts/run_prod.py"
echo "Commencing Production Run"
$ProdRun #Runs python script
wait
DupCompare="R CMD BATCH --no-save ../dupCompareTD.R" #Runs R script
$DupCompare
Now my issues is that often the python script can generate a whole heap of different processes on our linux server depending on its input, with lots of different PIDs AND we have heaps of workers using the same server firing off scripts. As far as I can tell from reading, the 'wait' command must wait for all processes to finish or for a specific PID to finish, but when i cannot tell what or how many PIDs will be assigned/processes run, how exactly do I use it?
EDIT: Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I broke up the ProdRun python script into its individual script that it was itself calling, but still had the issue, I think found that one of these scripts was also calling another smaller script that had a "&" at the end of it that was ignoring any commands to wait on it inside the python script itself. Simply removing this and inserting a line of "os.system()" allowed all the code to run sequentially.
It sounds like you are trying to implement a job scheduler with possibly some complex dependencies between different tasks. I recommend to use a job scheduler instead. It allows you to specify to run those jobs whilst also benefitting from features like monitoring, handling exceptional cases, errors, ...
Examples are: the open source rundeck https://github.com/rundeck/rundeck or the commercial one http://www.bmcsoftware.uk/it-solutions/control-m.html
Make your Python program wait on the children it spawns. That's the proper way to fix this scenario. Then you don't have to wait for Python after it finishes (sic).
(Also, don't put your commands in variables.)

How can I cause make -j to produce nice output?

I have a large project that is built using make. Because of the size of the project and the way the dependencies are organized, there's a real benefit to building in parallel using make -j. However, the output (that is, the logs and errors messages) that make -j produces is all mixed up, because all of the parallel tasks write to stdout at the same time.
How can I tell make to organize the output nicely? Ideally, I'd like it to buffer the logs from each task separately, and then output then in order as they complete. Is there any standard method of doing this?
You can use the -O or --output-sync command line options:
-O[type], --output-sync[=type]
When running multiple jobs in parallel with -j, ensure the output of each job is collected together rather than interspersed with output from other jobs. If type is not specified or is target the output from the entire recipe for each target is grouped together. If type is line the output from each command line within a recipe is grouped together. If type is recurse output from an entire recursive make is grouped together. If type is none output synchronization is disabled.
The online manual has more information.
(Note that you need GNU Make 4.0 for this to work.)

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

Resources