How do you setup multiple Spark Streaming jobs with different batch durations? - hadoop

We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:
SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()
The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.
The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?
Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!

In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.
To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).
It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.
In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.

Related

How different is the intent for Apache NiFi Stateless vs Apache NiFi MiNiFi

I'm looking at performing simple operations at scale (Data Extraction from files with Layout aware processing) the engine for which has a large startup time and the processing time on a file itself is in the order of a few minutes. Using a NiFi cluster (16+ nodes) for such processing results in the Apache NiFi cluster taking about 45 minutes for the cluster startup and being available (Deployment in Kubernetes). I was looking to see if Apache NiFi MiNiFi or Apache NiFi Stateless will be handy here to reduce the cluster startup time and also allow me to scale out the processing as required in an easier manner. Which of the two would be a better fit? I understand that MiNiFi itself is more for a data collection use case but was wondering if it still would fit my use case?
NiFi Stateless is basically an alternate runtime for NiFI flows. It is lighweight and doesn't persist data between restarts (hence the name :)
I think the information listed here is quite comprehensive:
https://github.com/apache/nifi/blob/main/nifi-stateless/nifi-stateless-assembly/README.md
https://bryanbende.com/development/2021/11/10/apache-nifi-stateless
NiFi MiNiFi is essentially a headless NiFi (doesn't support authoring flows) and has justa subset of NARs. On the other hand supports the C2 protocol and can download flows that way.
Based on your description if you have a single source and destination and you are ok with state lost between restarts stateless seems to be a better fit.

In memory processing with Apache Beam

I am running my own GRPC server collecting events coming from various data sources. The server is developed in Go and all the event sources send the events in a predefined format as a protobuf message.
What I want to do is to process all these events with Apache Beam in memory.
I looked through the docs of Apache Beam and couldn't find a sample that does something like I want. I'm not going to use Kafka, Flink or any other streaming platform, just process the messages in memory and output the results.
Can someone show me a direction of a right way to start coding a simple stream processing app?
Ok, first of all, Apache Beam is not a data processing engine, it's an SDK that allows you to create a unified pipeline and run it on different engines, like Spark, Flink, Google Dataflow, etc. So, to run a Beam pipeline you would need to leverage any of supported data processing engine or use DirectRunner, which will run your pipeline locally but (!) it has many limitations and was mostly developed for testing purposes.
As every pipeline in Beam, one has to have a source transform (bounded or unbounded) which will read data from your data source. I can guess that in your case it will be your GRPC server which should retransmit collected events. So, for the source transform, you either can use already implemented Beam IO transforms (IO connectors) or create your own since there is no GrpcIO or something similar for now in Beam.
Regarding the processing data in memory, I'm not sure that I fully understand what you meant. It will mostly depend on used data processing engine since in the end, your Beam pipeline will be translated in, for example, Spark or Flink pipeline (if you use SparkRunner or FlinkRunner accordingly) before actual running and then data processing engine will manage the pipeline workflow. Most of the modern engines do their best efforts to keep all processed data in memory and flush it on disk only in the last resort.

Deploy 2 different topologies on a single Nimbus with 2 different hardware

I have 2 sets of storm topologies in use today, one is up 24/7, and does it's own work.
The other, is deployed on demand, and handles a much bigger loads of data.
As of today, we have N supervisors instances, all from the same type of hardware (CPU/RAM), I'd like my on demand topology to run on stronger hardware, but as far as I know, there's no way to control which supervisor is assigned to which topology.
So if I can't control it, it's possible that the 24/7 topology would assign one of the stronger workers to itself.
Any ideas, if there is such a way?
Thanks in advance
Yes, you can control which topologies go where. This is the job of the scheduler.
You very likely want either the isolation scheduler or the resource aware scheduler. See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Storm-Scheduler.html and https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html.
The isolation scheduler lets you prevent Storm from running any other topologies on the machines you use to run the on demand topology. The resource aware scheduler would let you set the resource requirements for the on demand topology, and preferentially assign the strong machines to the on demand topology. See the priority section at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html#Topology-Priorities-and-Per-User-Resource.

deploy bolts/spout to a specific supervisor

We are running a storm application using a single type on instance in AWS and a single topology to run our system.
This is causing some resource limitation issues.
The way we want to address this is by splitting our IO intense bolts into a cluster of a few dozens t1.small machines (for example) and all our CPU intense bolts to two large machines with lots of cpu & memory.
Basically what i am asking is, is there a way to start all this supervisors and then deploy one topology that include cpu intense bolts on the big machines and to the small machines the deploy IO bolts?
You can implement a custom scheduler using interface IScheduler.
See
http://www.exogeni.net/2015/04/enabling-site-aware-scheduling-for-apache-storm-in-exogeni/
https://dcvan24.wordpress.com/2015/04/07/metadata-aware-custom-scheduler-in-storm/
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/DemoScheduler.java

What additional benefit does Yarn bring to the existing map reduce?

Yarn differs in its infrastructure layer from the original map reduce architecture in the following way:
In YARN, the job tracker is split into two different daemons called Resource Manager and Node Manager (node specific). The resource manager only manages the allocation of resources to the different jobs apart from comprising a scheduler which just takes care of the scheduling jobs without worrying about any monitoring or status updates. Different resources such as memory, cpu time, network bandwidth etc. are put into one unit called the Resource Container. There are different AppMasters running on different nodes which talk to a number of these resource containers and accordingly update the Node Manager with the monitoring/status details.
I want to know that how does using this kind of an approach increase the performance from the map-reduce perspective? Also, if there is any definitive content on the motivation behind Yarn and its benefits over the existing implementation of Map-reduce, please point me to the same.
Here are some of the articles (1, 2, 3) about YARN. These talk about the benefits of using YARN.
YARN is more general than MR and it should be possible to run other computing models like BSP besides MR. Prior to YARN, it required a separate cluster for MR, BSP and others. Now they they can coexist in a single cluster, which leads to higher usage of the cluster. Here are some of the applications ported to YARN.
From a MapReduce perspective in legacy MR there are separate slots for Map and Reduce tasks, but in YARN their is no fixed purpose of a container. The same container can be used for a Map task, Reduce task, Hama BSP Task or something else. This leads to better utilization.
Also, it makes it possible to run different versions of Hadoop in the same cluster which is not possible with legacy MR, which makes is easy from a maintenance point.
Here are some of the additional links for YARN. Also, Hadoop: The Definitive Guide, 3rd Edition has an entire section dedicated to YARN.
FYI, it had been a bit controversial to develop YARN instead of using some of frameworks which had been doing something similar and had been running for ages successfully with bugs ironed out.
I do not think that Yarn will speedup the existing MR framework. Looking into architecture we can see that the system now is more modular - but modularity usually contradicts higher performance.
It can be claimed that YARN has nothing to do with MapReduce. MapReduce just became one of the YARN applications. You can see it as moving from some embedded program to embeded OS with program within it
At the same time Yarn opens the door for different MR implementations with different frameworks. For example , if we assume that our dataset is smaller then cluster memory we can get much better performance. I think http://www.spark-project.org/ is one such example
To summarize it: Yarn does not improve the existing MR, but will enable other MR implementations to be better in all aspects.
All the above answers covered lot of information: I am simplifying all the information as follows:
MapReduce: YARN:
1. It is Platform plus Application It is a Platform in Hadoop 2.0 and
in Hadoop 1. 0 and it is only of doesn't exist in Hadoop 1.0
the applications in Hadoop 2.0
2. It is single use system i.e., It is multi purpose system, We can run
We can run MapReduce jobs only. MapReduce, Spark, Tez, Flink, BSP, MPP,
MPI, Giraph etc... (General Purpose)
3. JobTracker scalability i.e., Both Resource Management and
Both Resource Management and Application Management gets separated &
Job Management managed by RM+NM, Paradigm specific AMs
respectively.
4. Poor Resource Management Flexible Resource Management i.e.,
system i.e., slots (map/reduce) containers.
5. It is not highly available High availability and reliability.
6. Scaled out up to 5000 nodes Scaled out 10000 plus nodes.
7. Job->tasks Application -> DAG of Jobs -> tasks
8. Classical MapReduce = MapReduce Yarn MapReduce = MapReduce API +
API + MapReduce FrameWork MapReduce FrameWork + YARN System
+ MapReduce System So MR programs which were written over
Hadoop 1.0 run over Yarn also with out
changing a single line of code i.e.,
backward compatibility.
Let's see Hadoop 1.0 drawbacks, which have been addressed by Hadoop 2.0 with addition of Yarn.
Issue of Scalability : Job Tracker runs on a single machine even though you have thousands of nodes in Hadoop cluster. The responsibilities of Job tracker : Resource management, Job and Task schedule and monitoring. Since all these processes are running on a single node, this model is not scalable.
Issue of availability ( Single point of failure): Job Tracker is a single point of failure.
Resource utilization: Due to predefined number of Map & Reduce task slots, resources are not utilized properly. When all Mapper nodes are busy, Reducer nodes are idle and can't be used to process Mapper tasks.
Tight integration with Map Reduce framework: Hadoop 1.x can run Map reduce jobs only. Support for jobs other than Map Reduce jobs does not exists.
Now single Job Tracker bottleneck has been removed with YARN architecture in Hadoop 2.x
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Now advantages of YARN
Scalability issues have been resolved
No single point of failure. All components are highly available
Resource utilization has been improved with proper utilization of Map and reduce slots.
Non Map Reduce Jobs can be submitted
It looks like this link might be what you're looking for: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.
My understanding is that YARN is supposed to be more generic. You can create your own YARN applications that negotiate directly with the Resource Manager for resources (1), and MapReduce is just one of several Application Managers that already exist (2).

Resources