Passing multiple arguments to external programs in a Pipeline - shell

I'm trying to build a pipeline for NGS data.
I made a small example pipeline for passing commands to shell. Example pipeline has two scripts thats called from shell that just concatenates(sumtool.py) and multiplies(multool.py) values in many dataframes (10 in this case). My wrapper(wrapper.py) handles with the input and passes the commands that runs the scripts in order. Here is the relevant part of the code from the wrapper:
def run_cmd(orig_func):
#wraps(orig_func)
def wrapper(*args,**kwargs):
cmdls = orig_func(*args,**kwargs)
cmdc = ' '.join(str(arg) for arg in cmdls)
cmd = cmdc.replace(',','')
Popen(cmd,shell=True).wait()
return wrapper
#run_cmd
def runsumtool(*args):
return args
for file in getcsv():
runsumtool('python3','sumtool.py','--infile={}'.format(file),'--outfile={}'.format(dirlist[1]))
This works alright but I want to be able to pass all the commands at once for the first script with all the dataframes wait for it to finish and then run the second script with all commands at once for every dataframe. Since Popen().wait() waits for each command it takes way longer.
I tried to incorporate luigi for a solution but I wasn't successful running external programs or trying to pass multiple I/O's with luigi. Any tip on that is appreciated.
Another solution I'm imagining is passing the samples individually all at once but I'm not sure how to put it in python(or any other language really). This would also solve the I/O problem with luigi.
thanks
Note1: This is a small example pipeline I build. My main purpose is to call programs like bwa, picard in a pipeline ... which i cannot import.
Note2: I'm using Popen from subprocess already. You can find it between lines 4 and 5.

Related

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.
EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

How to monitor and control background processes in shell script

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Correct use Bash wait command with unknown processes number

Im writing a bash script that essentially fires off a python script that takes roughly 10 hours to complete, followed by an R script that checks the outputs of the python script for anything I need to be concerned about. Here is what I have:
ProdRun="python scripts/run_prod.py"
echo "Commencing Production Run"
$ProdRun #Runs python script
wait
DupCompare="R CMD BATCH --no-save ../dupCompareTD.R" #Runs R script
$DupCompare
Now my issues is that often the python script can generate a whole heap of different processes on our linux server depending on its input, with lots of different PIDs AND we have heaps of workers using the same server firing off scripts. As far as I can tell from reading, the 'wait' command must wait for all processes to finish or for a specific PID to finish, but when i cannot tell what or how many PIDs will be assigned/processes run, how exactly do I use it?
EDIT: Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I broke up the ProdRun python script into its individual script that it was itself calling, but still had the issue, I think found that one of these scripts was also calling another smaller script that had a "&" at the end of it that was ignoring any commands to wait on it inside the python script itself. Simply removing this and inserting a line of "os.system()" allowed all the code to run sequentially.
It sounds like you are trying to implement a job scheduler with possibly some complex dependencies between different tasks. I recommend to use a job scheduler instead. It allows you to specify to run those jobs whilst also benefitting from features like monitoring, handling exceptional cases, errors, ...
Examples are: the open source rundeck https://github.com/rundeck/rundeck or the commercial one http://www.bmcsoftware.uk/it-solutions/control-m.html
Make your Python program wait on the children it spawns. That's the proper way to fix this scenario. Then you don't have to wait for Python after it finishes (sic).
(Also, don't put your commands in variables.)

How to execute multiple proc in tcl scripting

I have 4 proc in my tcl script. Each proc contain a while loop to wait for a task to be finished and to process the result files subsequently. My purpose now is to parallel this 4 process together instead of 1 by 1. Anyone has any idea?
Background:
The normal way before is I open 4 terminal in KDE/GNOME to execute the different tasks. 4 different tasks actually running together.
Tcl threads can do the job just fine: http://www.tcl.tk/man/tcl8.6/ThreadCmd/thread.htm
Of course you may just leave everything as it is and run your scripts in the background within one terminal, if that's what you are looking for, e.g.
script1.tcl &
script2.tcl &
threading is better option for this scenario and it gives better control for your subprocess. You refer the following link for simple example : https://www.activestate.com/blog/2016/09/threads-done-right-tcl

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

Resources