How to pass pipeline data to azure ml pipeline databricks step? - azure-databricks

I have created an Azure ml pipeline consisting of 4 steps. First, two steps are python script steps and the 3rd one is databricks step and 4th one is also python script step. I am creating a pipeline data and passing it to all subsequent steps.
prepped_data_parameter = PipelineData('prepped_parameter', datastore=data_store)
2nd python step can read the value from pipeline data but it is not working in databricks step.
2. I have also tested passing data from one databricks step to another databricks step, thinking that dbfs path might be causing the problem. Here also it is not working.
python script step produces path like this when i make pipeline data :
``` /mnt/batch/tasks/shared/LS_root/jobs/******/azureml/bb743894-f7d6-4125-***-bccaa854fb65/mounts/workspaceblobstore/azureml/******-742d-4144-a090-a8ac323fa498/prepped_parameter/ ```
Databricks step produces like this for the same :
wasbs://azureml-blobstore-*******-983b-47b6-9c30-d544eb05f2c6#*************l001.blob.core.windows.net/azureml/*****-742d-4144-a090-a8ac323fa498/prepped_parameter/
I want to know how I can efficiently pass pipeline data from python to databricks step or vice versa without manually storing the data into datastore and deleting it for intermediate pipeline data.

Related

Read from memory for full pipeline, read from files if retry or partial pipeline

How can I use the pipeline to run from memory/file? I think the features are there but I am not sure how I can write the pipeline like this.
My use case is:
normal pipeline, from step 1 to step 10
run from step 2 to step 10
Imagine at step 1, I will write a dataframe to a csv, and step 2 will need to read from that. If I am running from step 1, I would want to pass that dataframe in memory (to save the read time). But if I start running from step 2, I will need to read from csv.
What is the best practice to do so with Kedro?
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#pipeline-with-circular-dependencies
I can think of 2 ways, depending on your use case.
a) You could use separate environments for this. When running the full pipeline, you use some environment regular where you don't have a catalog entry for dataset in question (hence will be turned into MemoryDataSet), while in a separate dev environment you have an entry in your catalog.yml to save it as a CSV. But it does mean you'd have to run dev from node 1 in order to generate the csv to be used for subsequent runs.
kedro run --env regular
kedro run --env dev
kedro run -e dev --from-nodes node2
Relevant docs: https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
b) Another way to do it, if you always want the first node to write to csv, is to have node1 return 2 outputs (same data), one as pandas.CSVDataSet and one as MemoryDataSet, and you define different pipelines. Pipeline complete where second node reads from memory, and partial where you don't have node1, and node2 reads from the csv dataset.
kedro run --pipeline complete
kedro run --pipeline partial
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#running-a-pipeline-by-name
In addition to options suggested by #Lorena Balan, you can use a CachedDataSet. Your catalog entry will look similar to this:
my_cached_dataset:
type: CachedDataSet
dataset:
type: pandas.CSVDataSet
filepath: path/to/file
Cached dataset will save the data using the regular underlying dataset and will also populate its internal cache, then the subsequent load will pull the data from that memory cache. If the cache is empty (your scenario 2), CachedDataSet on load will pull the data from the underlying [CSV] file.

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

Running pipelines with data parallellization

I've been running the kedro tutorials (the hello world and the spaceflight) and I'm wondering if it's easily possible to do data parallelization using Kedro.
Imagine, the situation where I have a node that needs to be executed in millions of files.
I've seem that there's the option kedro run -p, but this do only task parallelization (as stated here https://kedro.readthedocs.io/en/latest/03_tutorial/04_create_pipelines.html).
Thanks for the any feedback
Kedro has a number of build-in DataSet classes. For IO parallelization, there is SparkDataSet which delegates IO parallelization to PySpark https://kedro.readthedocs.io/en/latest/04_user_guide/09_pyspark.html#creating-a-sparkdataset
Another dataset is DaskDataSet, but this is still WIP in this PR https://github.com/quantumblacklabs/kedro/pull/97 (if you want to use Dask, you could have a look at this PR and create your own custom dataset)

Spark RDD problems

I am starting with spark and have never worked with Hadoop. I have 10 iMacs on which I have installed Spark 1.6.1 with Hadoop 2.6. I downloaded the precompiled version and just copied the extracted contents into /usr/local/spark/. I did all the environment variables setup with SCALA_HOME, changes to PATH and other spark conf. I am able to run both spark-shell and pyspark (with anaconda's python).
I have setup the standalone cluster; all the nodes are showing up on my web UI. Now, by using the python shell (ran on the cluster not locally) I followed this link's python interpreter word count example.
This is the code I have used
from operator import add
def tokenize(text):
return text.split()
text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)
counts.saveAsTextFile("wc")
It is giving me error that the file shakespeare.txt was not found on a slave nodes. Searching around I understood that if I am not using HDFS then the file should be present on each slave node on the same path. Here is the stack trace - github gist
Now, I have a few questions-
Isn't RDD supposed to be distributed? That is, it should have distributed (when the action was run on RDD) the file on all the nodes instead of requiring me to distribute it.
I downloaded the spark with Hadoop 2.6, but any of the Hadoop commands are not available to make a HDFS. I extracted the Hadoop jar file found in the spark/lib hoping to find some executable but there was nothing. So, what Hadoop related files were provided in the spark download?
Lastly, how can I run a distributed application (spark-submit) or a distributed analysis (using pyspark) on the cluster? If I have to create a HDFS then what extra steps are required? Also, how can I create a HDFS here?
If you read the Spark Programming Guide, you will find the answer to your first question:
To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset
is not loaded in memory or otherwise acted on: lines is merely a
pointer to the file. The second line defines lineLengths as the result
of a map transformation. Again, lineLengths is not immediately
computed, due to laziness. Finally, we run reduce, which is an action.
At this point Spark breaks the computation into tasks to run on
separate machines, and each machine runs both its part of the map and
a local reduction, returning only its answer to the driver program.
Remember that transformations are executed on the Spark workers (see link, slide n.21).
Regarding your second question, Spark contains only the libs, as you can see, to use the Hadoop infrastructure. You need to setup the Hadoop cluster first (Hdfs, etc etc), in order to use it (with the libs in Spark): have a look at Hadoop Cluster Setup.
To answer your last question, I hope that the official documentation helps, in particular Spark Standalone.

Writing MapReduce job to concurrently download files?

Not sure if this is a suitable use case for MapReduce: Part of the OOZIE workflow I'm trying to implement is to download a series of files named with sequential numbers (e.g. 1 through 20). I wanted those files to be downloaded simultaneously (5 files at a time), so I created a python script that creates 5 text files as follows:
1.txt: 1,2,3,4
2.txt: 5,6,7,8
3.txt: 9,10,11,12
4.txt: 13,14,15,16
5.txt: 17,18,19,20
Then for the next step of the workflow, I created a download.sh shell script that consumes a comma-separated list of numbers and download the requested files. In the workflow, I setup a streaming action in Oozie and used the directory that contains files generated above as input (mapred.input.dir) and used download.sh as the mapper command and "cat" as the reducer command. I assumed that Hadoop will spawn a different mapper for each of the input files above.
This seems to work sometimes, it would download the files correctly, but sometimes it just get stuck trying to execute and I don't know why. I noticed that this happen when I increase the number of simultaneous downloads (e.g. instead of files per txt file, I would do 20 and so forth).
So my question is: Is this a correct way to implement parallel retrieval of files using MapReduce and OOZIE? If not, how is this normally done using OOZIE? I'm trying to get my CSV files into the HDFS prior to running the Hive script and I'm not sure what the best way would be to achieve that.
After looking deeper into this, it seems that creating an Oozie "Fork" node would be the best approach. So I created a fork node, under which I created 6 shell actions that executes download.sh and take the list of file numbers as an argument. So I ended up modifying the python script so it outputs the file numbers that need to be downloaded to STDOUT (instead of saving them on HDFS). I had oozie capture that output and then pass them as arguments to the download.sh forks.
Cloudera Hue interface does not provide a way to create fork nodes (at least not that I was able to find) so I downloaded the workflow.xml file and added the fork nodes myself and then re-imported it as a new workflow.

Resources