How to monitor and control background processes in shell script - shell

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Related

How to have multiple processes in zsh or bash appending to the same file concurrently?

I am running some tests from shell (currently zsh but could be bash as well) concurrently that will output the results by appending to the same file (results.txt).
Each result will be a few lines and some hundred bytes long (not longer than 1000).
I want to be able to read the output of each test in whole in the results file, without any interleaving from any test that might have finished at the same time.
I see a couple of obvious options from a theoretical point of view:
Atomic write(append)
Use a mutex to acquire the results file
The problem is that I have no idea whether how to do any of the 2 in a shell.
More specifically, I don't know if appends are atomic by nature and I don't know how to use a mutex in a shell context.
Any help appreciated.

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.
EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

Correct use Bash wait command with unknown processes number

Im writing a bash script that essentially fires off a python script that takes roughly 10 hours to complete, followed by an R script that checks the outputs of the python script for anything I need to be concerned about. Here is what I have:
ProdRun="python scripts/run_prod.py"
echo "Commencing Production Run"
$ProdRun #Runs python script
wait
DupCompare="R CMD BATCH --no-save ../dupCompareTD.R" #Runs R script
$DupCompare
Now my issues is that often the python script can generate a whole heap of different processes on our linux server depending on its input, with lots of different PIDs AND we have heaps of workers using the same server firing off scripts. As far as I can tell from reading, the 'wait' command must wait for all processes to finish or for a specific PID to finish, but when i cannot tell what or how many PIDs will be assigned/processes run, how exactly do I use it?
EDIT: Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I broke up the ProdRun python script into its individual script that it was itself calling, but still had the issue, I think found that one of these scripts was also calling another smaller script that had a "&" at the end of it that was ignoring any commands to wait on it inside the python script itself. Simply removing this and inserting a line of "os.system()" allowed all the code to run sequentially.
It sounds like you are trying to implement a job scheduler with possibly some complex dependencies between different tasks. I recommend to use a job scheduler instead. It allows you to specify to run those jobs whilst also benefitting from features like monitoring, handling exceptional cases, errors, ...
Examples are: the open source rundeck https://github.com/rundeck/rundeck or the commercial one http://www.bmcsoftware.uk/it-solutions/control-m.html
Make your Python program wait on the children it spawns. That's the proper way to fix this scenario. Then you don't have to wait for Python after it finishes (sic).
(Also, don't put your commands in variables.)

Reading file in parallel from multiple processes

I'm running multiple processes in parallel and each of these processes read the same file in parallel. It looks like some of the processes see a corrupted version of the file if I increase the number of processes to > 15 or so. What is the recommended way of handling such a scenario?
More details:
The file being read in parallel is actually a perl script. The multiple jobs are python processes, and each of them launch this perl script independently with different input parameters. When the number of jobs is increased, some of these jobs give errors that the perl script has invalid syntax (which is not true). Hence, I suspect that some of these jobs read in corrupted versions of the perl script.
I'm running all of this on a 32core machine.
If any process is also writing to the file, then you need to enforce some synchronization, for example with a global named mutex.
If there is no asynchronous writing going on, I would not expect to see corruption during the reads. Are you opening the files with "r" access? If you're still encountering troubles, it might be worth experimenting with reducing read buffer size. Or call out to a native win32 API for the file access.
Good luck!

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

Resources