Does Kedro support Checkpointing/Caching of Results? - kedro

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.

You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

Related

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.
EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

Is there a way to parallelize spark.read.load(string*) when reading many files?

I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...), it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).
Is this expected?
The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.
Thanks
PS, I am trying to see it in the source code, but not sure if this is the one:
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180
Might not be an exact "answer" to my original question, but I found out the reason for my particular case: from name node's audit log, it was found that there were some runaway jobs pegging name node, which greatly slowed down the rpc calls. After killing these bad jobs, the spark's load speed was greatly improved.

Running pipelines with data parallellization

I've been running the kedro tutorials (the hello world and the spaceflight) and I'm wondering if it's easily possible to do data parallelization using Kedro.
Imagine, the situation where I have a node that needs to be executed in millions of files.
I've seem that there's the option kedro run -p, but this do only task parallelization (as stated here https://kedro.readthedocs.io/en/latest/03_tutorial/04_create_pipelines.html).
Thanks for the any feedback
Kedro has a number of build-in DataSet classes. For IO parallelization, there is SparkDataSet which delegates IO parallelization to PySpark https://kedro.readthedocs.io/en/latest/04_user_guide/09_pyspark.html#creating-a-sparkdataset
Another dataset is DaskDataSet, but this is still WIP in this PR https://github.com/quantumblacklabs/kedro/pull/97 (if you want to use Dask, you could have a look at this PR and create your own custom dataset)

Is snakemake the right tool to use for handling output mediated workflows

I'm new to trying out snakemake (last week or so) in order to handle less of the small details for workflows, previously I have coded up my own specific workflow through python.
I generated a small workflow which among the steps would use Illumina PE reads and ran Kraken against them. I'd then parse the output of the Kraken output to detect the most common species (within a set of allowable) if a species value wasn't provided (running with snakemake -s test.snake --config R1_reads= R2_reads= species=''.
I have 2 questions.
What is the recommended approach given the dynamic output/input?
Currently my strategy for this is to create a temp file which
contains the detected species and then cat {input.species} it into
other shell commands. This doesn't seem elegant but looking through
the docs I couldn't quite find an adequate alternative. I noticed
PersistentDicts would let me pass variables between run: commands
but I'm unsure if I can use that to load variables into a shell:
section. I also noticed that wrappers could allow me to handle it
however from the point I need that variable on I'd be wrapping the
remainder of my workflow.
Is snakemake the right tool if I want to use the species afterwards to run a set of scripts specific to the species (with multiple species specific workflows)?
Right now my impression on how to solve this problem is to have multiple workflow files for the species and have a run with switch which calls the associated species workflow dependant on the species.
Appreciate any insight on these questions.
-Kim
You can mark output as dynamic (e.g. expecting one file per species). Then, Snakemake will determine the downstream DAG of jobs after those files have been generated. See http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files

Parent node workspace of Matrix jobs in Jenkins

When using the repo plugin in Jenkins for multi configuration jobs.
The source code is checked out once in the matrix parent workspace and then again in all the nodes where the multi configuration job spawns.
Is there a way to avoid the source code checkout on the parent node, which seems to be a duplication and unnecessary wastage of time when the same code is again going to be checked out in the matrix.
There is no way to avoid it. That's the way the multi-configuration job works. Your options are:
Do not use multi-configuration job. You can replace it with multiple jobs or with a single job that takes a parameter or parameters and trigger all those jobs.
Keep using multi-configuration job but minimize the downsides of the parent checkout by having enough disk space for it and/or making sure the checkout is as fast as possible. There are ways to speed up a checkout but how exactly to do it depends on which version control system you use.

Resources