Why is ELT more suitable for stream processing?

Why is ELT more suitable for stream processing? - etl

See https://youtu.be/2pgaQIitxiQ at 55.30 and then 55.50
It suggests that Extract load transform (ELT) is more suitable for batch processing at 55.30 and the stream processing at 55.50.
I understand that the idea is that as the stream data comes in we load it and then transform.
But in case of batch processing also, isn't it the same concept. The data comes in, we load it, then process as batch.

Historically, the target DB would not have the power/capabilities to effectively transform data and therefore a dedicated transformation server was used - so we'd extract data from the source, transform it using the dedicated server and load it into the target DB. (ETL)
As the capabilities of DB servers (especially with the rise of cloud DBs) have improved, the use of a dedicated transformation server is no longer needed - so we can load directly to the target DB and transform the data there. (ELT)
In relation to your specific question, whether it makes more sense to do (some or all) transformations in the stream or in the target DB depends on the type of transformation and the capabilities of the stream and DB systems. There is no right or wrong answer

Related

How to apply machine learning for streaming data in Apache NIFI

I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?

You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.

Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.

Separate Data Access Layers for Distributed Compute

Overview
Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car objects are accessed via the CarService. These services are powered through Spring Data Repositories and access data (models) stored in both PostgreSQL and Elasticsearch.
We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.
Problem
After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD and Beam requires PCollection and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.
My Question
Has anyone encountered this before? What was your strategy?
Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?

In software engineering, multi-tier architecture is a client-server architecture in which, the presentation, the application processing and the data management are logically separate processes, crosscutting concern or logical separation helps for performance, scalability and maintenance.
keep in mind that tiers are at logical levels which means that may or may not be many physical layers.
If you are going with Image 1 then no need new DAO layers but in Image 2 , i will suggest create separate project and use EAI pattern to communicate both projects
Image 1:
In image 1 you can process data and keep into database and use same DAO layer to get data
Image 2:
In image 2 You can create separate layer where you have to submit your jobs and collect results directly into your spring code .
https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html
Apacge Spark or Bigdata has diffrent archetecture styles ,plesae read following links .
http://lambda-architecture.net/
http://milinda.pathirage.org/kappa-architecture.com/
https://mapr.com/solutions/zeta-enterprise-architecture/
What are the differences between kappa-architecture and lambda-architecture

Spark save files distributedly

According to the Spark documentation,
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
I am currently working on a large dataset that, once processed, outputs even a bigger amount of data, which needs to be stored in text files, as done with the command saveAsTextFile(path).
So far I have been using this method; however, since it is an action (as stated above) and not a transformation, Spark needs to send data from every partition to the driver node, thus slowing down the process of saving quite a bit.
I was wondering if any distributed file saving method (similar to saveAsTextFile()) exists on Spark, enabling each executor to store its own partition by itself.

I think you're misinterpreting what it means to send a result to the driver. saveAsTextFile does not send the data back to the driver. Rather, it sends the result of the save back to the driver once it's complete. That is, saveAsTextFile is distributed. The only case where it's not distributed is if you only have a single partition or you've coallesced your RDD back to a single partition before calling saveAsTextFile.
What that documentation is referring to is sending the result of saveAsTextFile (or any other "action") back to the driver. If you call collect() then it will indeed send the data to the driver, but saveAsTextFile only sends a succeed/failed message back to the driver once complete. The save itself is still done on many nodes in the cluster, which is why you'll end up with many files - one per partition.
IO is always expensive. But sometimes it can seem as if saveAsTextFile is even more expensive precisely because of the lazy behavior described in that excerpt. Essentially, when saveAsTextFile is called, Spark may perform many or all of the prior operations on its way to being saved. That is what is meant by laziness.
If you have the Spark UI set up it may give you better insight into what is happening to the data on its way to a save (if you haven't already done that).

what is a data serialization system?

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?
also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?
Thanks in advance!

So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:
Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.
As he described Java Serialization:
It looked big and hairy and I though we needed something lean and mean
Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.
As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.
Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.
Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.
So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.
I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.

If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.
actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form
Here is the link to the list of serialization formats. Comment if you want further information! :)

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?

See this PDF.

How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio