Parallelism for Entire Kedro Pipeline - kedro

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.

EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

Related

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

How to monitor and control background processes in shell script

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Force MPI program to run using only one process per node even if job script runs using several procs/node?

I have a MPI program (Stream benchmark) which I need to run using 1 process per node even if I set for example 16 processes per node in my slurm job script. As I want only 1 process of a node use the entire memory of the node only on that MPI program.
I don't want to do any changes in my slurm job script. Since my stream-code is invoked in my other mpi-code so I have one executable file which includes two parts. I only want the part includes stream-code to be run on only one procs per node.
Shall I use some functions like memset? If so, how?
Is there any idea?

Passing variable between python scripts through shell script

I can't think of a way of doing what I am trying to do and hoping for a little advice. I am working with data on a computing cluster, and would like to process individual files on separate computing nodes. The workflow I have right now is something like the following:
**file1.py**
Get files, parameters, other info from user
Then Call: file2.sh
**file2.sh**
Submit file3.py to computing node
**file3.py**
Process input file with parameters given
What I am trying to do is call file2.sh and pass it each input data file one at a time so that there are multiple instances of file3.py running, one per file. Is there a good way to do this?
I suppose that the root of the problem is that if i were to iterate through a list of input files in file1.py I don't know how to then pass that information to file2.sh and then on to file3.py.
From this description, I'd say the the straightforward way is to call file2.sh directly from Python.
status, result = commands.getstatusoutput("file2.sh" + arg_string)
Is that enough of a start to get you moving? Are the nodes conversant enough for one to launch a command directly on another? If not, you may want to consider looking up "interprocess communication" on Linux. If they're not even on the same Internet node, you'll likely need REST commands (post and get operations), from whence things grow more overhead.

Parallel processing in condor

I have a java program that will process 800 images.
I decided to use Condor as a platform for distributed computing, aiming that I can divide those images onto available nodes -> get processed -> combined the results back to me.
Say I have 4 nodes. I want to divide the processing to be 200 images on each node and combine the end result back to me.
I have tried executing it normally by submitting it as java program and stating the requirements = Machine == .. (stating all nodes). But it doesn't seem to work.
How can I divide the processing and execute it in parallel?
HTCondor can definitely help you but you might need to do a little bit of work yourself :-)
There are two possible approaches that come to mind: job arrays and DAG applications.
Job arrays: as you can see from example 5 on the HTCondor Quick Start Guide, you can use the queue command to submit more than 1 job. For instance, queue 800 at the bottom of your job file would submit 800 jobs to your HTCondor pool.
What people do in this case is organize the data to process using a filename convention and exploit that convention in the job file. For instance you could rename your images as img_0.jpg, img_1.jpg, ... img_799.jpg (possibly using symlinks rather than renaming the actual files) and then use a job file along these lines:
Executable = /path/to/my/script
Arguments = /path/to/data/dir/img_$(Process)
Queue 800
When the 800 jobs run, $(Process) gets automatically assigned the value of the corresponding process ID (i.e. a integer going from 0 to 799). Which means that your code will pick up the correct image to process.
DAG: Another approach is to organize your processing in a simple DAG. In this case you could have a pre-processing script (SCRIPT PRE entry in your DAG file) organizing your input data (possibly creating symlinks named appropriately). The real job would be just like the example above.

Resources