Parent node workspace of Matrix jobs in Jenkins - matrix

When using the repo plugin in Jenkins for multi configuration jobs.
The source code is checked out once in the matrix parent workspace and then again in all the nodes where the multi configuration job spawns.
Is there a way to avoid the source code checkout on the parent node, which seems to be a duplication and unnecessary wastage of time when the same code is again going to be checked out in the matrix.

There is no way to avoid it. That's the way the multi-configuration job works. Your options are:
Do not use multi-configuration job. You can replace it with multiple jobs or with a single job that takes a parameter or parameters and trigger all those jobs.
Keep using multi-configuration job but minimize the downsides of the parent checkout by having enough disk space for it and/or making sure the checkout is as fast as possible. There are ways to speed up a checkout but how exactly to do it depends on which version control system you use.

Related

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.
EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

Parallel processing in condor

I have a java program that will process 800 images.
I decided to use Condor as a platform for distributed computing, aiming that I can divide those images onto available nodes -> get processed -> combined the results back to me.
Say I have 4 nodes. I want to divide the processing to be 200 images on each node and combine the end result back to me.
I have tried executing it normally by submitting it as java program and stating the requirements = Machine == .. (stating all nodes). But it doesn't seem to work.
How can I divide the processing and execute it in parallel?
HTCondor can definitely help you but you might need to do a little bit of work yourself :-)
There are two possible approaches that come to mind: job arrays and DAG applications.
Job arrays: as you can see from example 5 on the HTCondor Quick Start Guide, you can use the queue command to submit more than 1 job. For instance, queue 800 at the bottom of your job file would submit 800 jobs to your HTCondor pool.
What people do in this case is organize the data to process using a filename convention and exploit that convention in the job file. For instance you could rename your images as img_0.jpg, img_1.jpg, ... img_799.jpg (possibly using symlinks rather than renaming the actual files) and then use a job file along these lines:
Executable = /path/to/my/script
Arguments = /path/to/data/dir/img_$(Process)
Queue 800
When the 800 jobs run, $(Process) gets automatically assigned the value of the corresponding process ID (i.e. a integer going from 0 to 799). Which means that your code will pick up the correct image to process.
DAG: Another approach is to organize your processing in a simple DAG. In this case you could have a pre-processing script (SCRIPT PRE entry in your DAG file) organizing your input data (possibly creating symlinks named appropriately). The real job would be just like the example above.

Hadoop Distributed Cache - modify file

I have a file in the distributed cache. The driver class, based on the output of a job, updates this file and starts a new job. The new job need these updates.
The way I currently do it is to replace the old Distributed Cache file with a new one (the updated one).
Is there a way of broadcasting the diffs (between the old file and the new one) to all the tasks trackers which need the file ?
Or is it the case that, after a job (the first one, in my case) is finished, all the directories/files specific to that job are deleted and consequently it doesn't even make sense to think in this direction ?
I think that distributed cache is not build with such scenario in mind. It simply put files locally.
In Your case I would suggest to put file in HDFS and make all interested parties to take it from there
As an optimization you can give this file high replication factor and it will be local to most of the tasks.

Optionally build multiple jobs in a row

We currently have created several jobs for our components. These components all depend on each other like the following:
A -> B -> C
Currently it is possible to run these jobs separately independent from each other. If someone is running C the build uses A and B artifacts from a previous build.
Now it should be possible to optionally build these jobs in a row. My first thought was some kind of a BuildAll-job which starts the other jobs in the right order, but it does not seem to be possible to start other jobs in a buildstep.
Solving this by using the Build other projects-option is not a solution, because this would always trigger the other builds if someone e.g. starts A.
So anyone got an idea on how to solve this? Is something like this possible? Perhaps I missed an option/plugin to use other jobs as buildsteps?
I would look at using the Parameterized Trigger plugin:
https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin
It allows you to trigger another job as a build step, with parameters if you need them. This would allow you to create BuildAll job that calls A, then B, then C in sequence.
Have you considered:
https://wiki.jenkins-ci.org/display/JENKINS/Join+Plugin
This can help you with the "Build-All" step if you want to go down that path.
However, one part that I do not understand is that,
if A -> B -> C,
how are any optional? If you can clarify, might be able to help you better.

Resources