Running pipelines with data parallellization - kedro

I've been running the kedro tutorials (the hello world and the spaceflight) and I'm wondering if it's easily possible to do data parallelization using Kedro.
Imagine, the situation where I have a node that needs to be executed in millions of files.
I've seem that there's the option kedro run -p, but this do only task parallelization (as stated here https://kedro.readthedocs.io/en/latest/03_tutorial/04_create_pipelines.html).
Thanks for the any feedback

Kedro has a number of build-in DataSet classes. For IO parallelization, there is SparkDataSet which delegates IO parallelization to PySpark https://kedro.readthedocs.io/en/latest/04_user_guide/09_pyspark.html#creating-a-sparkdataset
Another dataset is DaskDataSet, but this is still WIP in this PR https://github.com/quantumblacklabs/kedro/pull/97 (if you want to use Dask, you could have a look at this PR and create your own custom dataset)

Related

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

Is there a way to parallelize spark.read.load(string*) when reading many files?

I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...), it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).
Is this expected?
The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.
Thanks
PS, I am trying to see it in the source code, but not sure if this is the one:
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180
Might not be an exact "answer" to my original question, but I found out the reason for my particular case: from name node's audit log, it was found that there were some runaway jobs pegging name node, which greatly slowed down the rpc calls. After killing these bad jobs, the spark's load speed was greatly improved.

Best ETL Packages In Python

I have 2 use cases:
Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).
I am currently using:
SQLAlchemy for extracts (works well generally).
PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).
I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).
You can try pyetl an etl framework write by python3
from pyetl import Task, DatabaseReader, DatabaseWriter
reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")
writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")
columns = {"id": "uuid", "name": "full_name"}
functions={"id": str, "name": lambda x: x.strip()}
Task(reader, writer, columns=columns, functions=functions).start()
How about
Python
Pandas
This is what we use for our ETL processing.
I'm using Pandas to access my ETL files, try doing something like this:
Create a class with all your queries there.
Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.
Consider having a look at convtools library, it provides lots of data processing primitives, is pure python and has zero dependencies.
Since it generates ad hoc python code under the hood, sometimes it outperforms pandas/polars, so it can some gaps in your workflows. Especially if those have dynamic nature.

How can Spark take input after it is submitted

I am designing an application, which requires response very fast and need to retrieve and process a large volume of data (>40G) from hadoop file system, given one input (command).
I am thinking, if it is possible to catch such high amount of data in the distributed memory using spark, and let the application running all the time. If I give the application an command, it could start to process data based on the input.
I think catching such big data is not a problem. However, how can I let the application running, and take input?
As far as I know, there is nothing can be done after "spark-submit" command...
You can try spark job server and Named Objects to cache dataset in distributed memory and use it in various input commands.
The requirement is not clear!!!, but based on my understanding,
1) In spark-submit after the application.jar, you can provide application specific command line arguments. But if you want to send commands after the job was started, then you can write a spark streaming job which processes kafka messages.
2) HDFS is already optimised for processing large volume of data. You can cache intermediate reusable data so that they do not get re-computed. But for better performance you might consider using something like elasticsearch/cassandra, so that they can be fetched/stored even faster.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Resources