Read from memory for full pipeline, read from files if retry or partial pipeline - kedro

How can I use the pipeline to run from memory/file? I think the features are there but I am not sure how I can write the pipeline like this.
My use case is:
normal pipeline, from step 1 to step 10
run from step 2 to step 10
Imagine at step 1, I will write a dataframe to a csv, and step 2 will need to read from that. If I am running from step 1, I would want to pass that dataframe in memory (to save the read time). But if I start running from step 2, I will need to read from csv.
What is the best practice to do so with Kedro?
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#pipeline-with-circular-dependencies

I can think of 2 ways, depending on your use case.
a) You could use separate environments for this. When running the full pipeline, you use some environment regular where you don't have a catalog entry for dataset in question (hence will be turned into MemoryDataSet), while in a separate dev environment you have an entry in your catalog.yml to save it as a CSV. But it does mean you'd have to run dev from node 1 in order to generate the csv to be used for subsequent runs.
kedro run --env regular
kedro run --env dev
kedro run -e dev --from-nodes node2
Relevant docs: https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
b) Another way to do it, if you always want the first node to write to csv, is to have node1 return 2 outputs (same data), one as pandas.CSVDataSet and one as MemoryDataSet, and you define different pipelines. Pipeline complete where second node reads from memory, and partial where you don't have node1, and node2 reads from the csv dataset.
kedro run --pipeline complete
kedro run --pipeline partial
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#running-a-pipeline-by-name

In addition to options suggested by #Lorena Balan, you can use a CachedDataSet. Your catalog entry will look similar to this:
my_cached_dataset:
type: CachedDataSet
dataset:
type: pandas.CSVDataSet
filepath: path/to/file
Cached dataset will save the data using the regular underlying dataset and will also populate its internal cache, then the subsequent load will pull the data from that memory cache. If the cache is empty (your scenario 2), CachedDataSet on load will pull the data from the underlying [CSV] file.

Related

How to pass pipeline data to azure ml pipeline databricks step?

I have created an Azure ml pipeline consisting of 4 steps. First, two steps are python script steps and the 3rd one is databricks step and 4th one is also python script step. I am creating a pipeline data and passing it to all subsequent steps.
prepped_data_parameter = PipelineData('prepped_parameter', datastore=data_store)
2nd python step can read the value from pipeline data but it is not working in databricks step.
2. I have also tested passing data from one databricks step to another databricks step, thinking that dbfs path might be causing the problem. Here also it is not working.
python script step produces path like this when i make pipeline data :
``` /mnt/batch/tasks/shared/LS_root/jobs/******/azureml/bb743894-f7d6-4125-***-bccaa854fb65/mounts/workspaceblobstore/azureml/******-742d-4144-a090-a8ac323fa498/prepped_parameter/ ```
Databricks step produces like this for the same :
wasbs://azureml-blobstore-*******-983b-47b6-9c30-d544eb05f2c6#*************l001.blob.core.windows.net/azureml/*****-742d-4144-a090-a8ac323fa498/prepped_parameter/
I want to know how I can efficiently pass pipeline data from python to databricks step or vice versa without manually storing the data into datastore and deleting it for intermediate pipeline data.

Parallelism for Entire Kedro Pipeline

I am working on a project where we are processing very large images. The pipeline has several nodes, where each produces output necessary for the next node to run. My understanding is that the ParallelRunner is running the nodes in parallel. It is waiting for each process to finish the 1st node before moving onto the 2nd, etc. My problem is is that the inputs take varying amounts of time to complete. So many processes are stuck waiting for others to finish a node, when it is not necessary, as each process in parallel has no dependency on another, only its own previously computed results.
Is there a way to run the entire pipeline in parallel on different cores? I do not want each parallel process to wait for the other processes to finish a node. I have the idea that I could accomplish this by creating multiple copies of my kedro project and modify their data catalogs to process different parts of the dataset and then run these in parallel using the subprocess module, but this seems inefficient.
EDIT:
My understanding is that the ParallelRunner is running the nodes
in parallel. It is waiting for each process to finish the 1st node
before moving onto the 2nd, etc.
Not sure if I understand this correctly but as soon as a process finishes, it will move on immediately to the next node ready to be executed. It shouldn't wait on anything.
===
There is an alternative along the same line of your idea about multiple projects. However, you don't need to create multiple copies of the project to achieve the same result. You can parameterise a run with a certain set of inputs and write a wrapper script (bash, python, etc.) to invoke as many kedro run as you want. For example, if you want to have a dedicated Kedro run, which will then be on its own process, for one file in the data/01_raw directory, you could do:
for input in data/01_raw/*
do
file=$(basename $input)
kedro run --params=input:"$file"
done
The trick to make this work is implement a before_pipeline_run hook to dynamically add a catalog entry with the value of the input parameter. I have a demo repository here to demonstrate this technique: https://github.com/limdauto/demo-kedro-parameterised-runs -- let me know if this addresses your problem.

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

Writing MapReduce job to concurrently download files?

Not sure if this is a suitable use case for MapReduce: Part of the OOZIE workflow I'm trying to implement is to download a series of files named with sequential numbers (e.g. 1 through 20). I wanted those files to be downloaded simultaneously (5 files at a time), so I created a python script that creates 5 text files as follows:
1.txt: 1,2,3,4
2.txt: 5,6,7,8
3.txt: 9,10,11,12
4.txt: 13,14,15,16
5.txt: 17,18,19,20
Then for the next step of the workflow, I created a download.sh shell script that consumes a comma-separated list of numbers and download the requested files. In the workflow, I setup a streaming action in Oozie and used the directory that contains files generated above as input (mapred.input.dir) and used download.sh as the mapper command and "cat" as the reducer command. I assumed that Hadoop will spawn a different mapper for each of the input files above.
This seems to work sometimes, it would download the files correctly, but sometimes it just get stuck trying to execute and I don't know why. I noticed that this happen when I increase the number of simultaneous downloads (e.g. instead of files per txt file, I would do 20 and so forth).
So my question is: Is this a correct way to implement parallel retrieval of files using MapReduce and OOZIE? If not, how is this normally done using OOZIE? I'm trying to get my CSV files into the HDFS prior to running the Hive script and I'm not sure what the best way would be to achieve that.
After looking deeper into this, it seems that creating an Oozie "Fork" node would be the best approach. So I created a fork node, under which I created 6 shell actions that executes download.sh and take the list of file numbers as an argument. So I ended up modifying the python script so it outputs the file numbers that need to be downloaded to STDOUT (instead of saving them on HDFS). I had oozie capture that output and then pass them as arguments to the download.sh forks.
Cloudera Hue interface does not provide a way to create fork nodes (at least not that I was able to find) so I downloaded the workflow.xml file and added the fork nodes myself and then re-imported it as a new workflow.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Resources