databricks Structured Streaming: Where are the reliable receivers on streaming architecture - spark-streaming

All, trying to understand the Databricks Structured Streaming architecture.
Is this architecture diagram relevant for Structured Streaming as well?
If so here are my questions:
Q1: I see here the concept of reliable recievers.Where do these reliable recievers live? On the driver or worker. In otherwords, the reading to the source happens from the worker or driver?
Q2: As we see in the spark streaming official diagram, a reciever is a single machine that receives records. So if we have 20 partitions in EventHub Source, are we limited by the Driver's Core Restriction for the maximum concurrent reads? Otherwords, we can only perform concurrent reads to source not parallel?
Q3: Related to Q2, does this mean the parallelism in structured streaming can be achieved only for processing?
The below is my version of the architecture, please let me know if this needs any changes.
Thanks in advance.

As per my understanding from the spark streaming documentation
Answer for Q1 : The receivers live on the worker nodes
Answer for Q2 : Since the receivers run on workers, in case of a cluster, the driver's cores does not limit the receivers. Each receiver occupies a single core and gets allocated by a round-robin
Answer for Q3 : Read parallelism can be achieved by increasing the number of receivers/partitions on the source
These info is documented here
Please correct me if this is incorrect. Thanks.

Related

Apache Kafka's Streams API vs Spark Streaming

I am comparing the throughput of spark streaming and Kafka streams. My results state that Kafka Streams has a higher throughput than Spark streaming. Is this correct? Shouldn't it be the other way around?
Thanks
No one streaming platform is universally faster than all others for every use case. Don't get fooled by benchmarketing results that compare apples to oranges (like Kafka Streams reading from a disk based source vs Spark Streaming reading from an in-memory source). You haven't posted your test but it is entirely possible that it represents a use case (and test environment) in which Kafka Streams is indeed faster.

Throughput for Kafka, Spark, Elasticsearch Stack on GCP/Dataproc

I'm working on a research project where I installed a complete data analysis pipeline on Google Cloud Platform. We estimate unique visitors per URL in real-time using HyperLogLog on Spark. I used Dataproc to set up the Spark Cluster. One goal of this work is to measure the throughput of the architecture depending on the cluster size. The Spark cluster has three nodes (minimal configuration)
A data stream is simulated with own data generators written in Java where I used the kafka producer API. The architecture looks as follows:
Data generators -> Kafka -> Spark Streaming -> Elasticsearch.
The problem is: As I increase the number of produced events per second on my data generators and it goes beyond ~ 1000 events/s the input rate in my Spark job suddenly collapses and begin to vary a lot.
As you can see on the screenshot from the Spark Web UI, the processing times and scheduling delays keep constant short, while the input rate goes down.
Screenshot from Spark Web UI
I tested it with a complete simple Spark job which only does a simple mapping, to exclude causes like slow Elasticsearch writes or problems with the job itself. Kafka also seems to receive and send all the events correctly.
Furthermore I experimented with the Spark configuration parameters:
spark.streaming.kafka.maxRatePerPartition and spark.streaming.receiver.maxRate
with the same result.
Does anybody have some ideas what goes wrong here? It really seems to be up to the Spark Job or Dataproc... but I'm not sure. All CPU and memory utilizations seem to be okay.
EDIT: Currently I have two kafka partitions on that topic (placed on one machine). But I think Kafka should even with only one partition do more than 1500 Events/s. The problem also was with one partition at the beginning of my experiments. I use direct approach with no receivers, so Spark reads with two worker nodes concurretly from the topic.
EDIT 2: I found out what causes this bad throughput. I forgot to mention one component in my architecture. I use one central Flume agent to log all the events from my simulator instances via log4j via netcat. This flume agent is the cause of the performance problem! I changed the log4j configuration to use asynchronuous loggers (https://logging.apache.org/log4j/2.x/manual/async.html) via disruptor. I scaled the Flume agent up to more CPU cores and RAM and changed the channel to a file channel. But it still has a bad performance. No effect... any other ideas how to tune Flume performance?
Hard to say given the sparse amount of information. I would suspect a memory issue - at some point, the servers may even start swapping. So, check the JVM memory utilizations and swapping activity on all servers. Elasticsearch should be capable of handling ~15.000 records/second with little tweaking. Check the free and committed RAM on the servers.
As I mentioned before CPU and RAM utilizations are totally fine. I found out a "magic limit", it seems to be exactly 1500 events per second. As I exceed this limit the input rate immediately begins to wobble.
The misterious thing is that processing times and scheduling delays stay constant. So one can exclude backpressure effects, right?
The only thing I can guess is a technical limit with GCP/Dataproc... I didn't find any hints on the Google documentation.
Some other ideas?

Parallelism in Apache Storm

I am new to Apache Storm and trying to design a simple topology for my use case. The explanation for parallelism in Storm (Understanding the Parallelism of a Storm Topology) has left me with two queries:
1) Is it safe to assume that same worker will have the executors for
my spout as well as bolt if i have only one worker?
2) The inter worker communication uses ZeroMQ which uses network for communication as opposed to LMX Disruptors
used for intra-worker communication, which are faster as they are in-memory. Should I create a single worker for better performance?
Please answer the above queries and correct my understanding if incorrect.
1) yes
2) Using one worker per topology per machine is recommended since intra process communication is much more expensive in Storm.
refer to : https://storm.apache.org/documentation/FAQ.html
On my experience basis also, using multiple workers in one machine for same topology have negative impact on throughput.

Is Spark Streaming with a custom receiver a more generalized replacement for Flume in all use cases?

Our use case is (1) consuming data from ActiveMQ, (2) performing transformations through a general purpose reusable streaming process, and then (3) publishing to Kafka. In our case, step (2) would be a reusable Spark Streaming 'service' that would provide an event_source_id, enrich each record with metadata, and then publish to Kafka.
The straightforward approach I see is ActiveMQ -> Flume -> Spark Streaming -> Kafka.
Flume seems like an unnecessary extra step and network traffic. As far as I can tell, a Spark Streaming custom receiver would provide a more general solution for ingestion into hadoop (step 1), and, allows more flexibility for transforming the data as it is an inherent step for Spark Streaming itself, the downside being a loss of coding ease.
I would love to gain some insight from my more experienced peers as we are in the beginning stages of transforming a large data architecture; please help with any suggestions/insights/alternatives you can think of.
Thank you world
In theory, Flume should help you better create a more efficient ingestion to HDFS.
If using Spark Streaming, depending on how much you set up in your microbatch, it could not be that efficient - but if your use case needs more real time, then I think you could do it with Spark Streaming directly, yes.
Most applications would want to store the original data in HDFS so as to be able to refer to it back. Flume would help with that - but if you don't have that need, you may want to skip it. Also, you could always persist your RDD in Spark at any point.
Also, if you want to consume in realtime, you may want to look to Storm.
Your use case is weakly defined though, so more info on the constraints (volume, time requirements, how do you want to expose this info, etc.) would help to get more concrete answers.
EDIT: Here there is a link where they go from a 1-hour Flume + Hadoop, to another one on 5 seconds cycles - still using Flume to help with ingestion scalability. So it's up to your use case to use Flume there or not... I'd say it makes sense to separate the ingestion layer if you want that data to e.g. be consolidated in a lambda-like architecture.

Can druid replace hadoop?

Druid is used for both real time and batch processing. But can it totally replace hadoop?
If not why? As in what is the advantage of hadoop over druid?
I have read that druid is used along with hadoop. So can the use of Hadoop be avoided?
We are talking about two slightly related but very different technologies here.
Druid is a real-time analytics system and is a perfect fit for timeseries and time based events aggregation.
Hadoop is HDFS (a distributed file system) + Map Reduce (a paradigm for executing distributed processes), which together have created an eco system for distributed processing and act as underlying/influencing technology for many other open source projects.
You can setup druid to use Hadoop; that is to fire MR jobs to index batch data and to read its indexed data from HDFS (of course it will cache them locally on the local disk)
If you want to ignore Hadoop, you can do your indexing and loading from a local machine as well, of course with the penalty of being limited to one machine.
Can you avoid using Hadoop with Druid? Yes, you can stream data in real-time into a Druid cluster rather than batch-loading it with Hadoop. One way to do this is to stream data into Kafka, which will handle incoming events and pass them into Storm, which can then process and load them into Druid Realtime nodes.
Typically this setup is used with Hadoop in parallel, because streamed real-time data comes with its own baggage and often needs to be fixed up and backfilled. That whole architecture has been dubbed "Lambda" by some.
Druid is used for both real time and batch processing. But can it totally replace hadoop? If not why?
It depends on your cases. Have a look at Druid official website documentation.
Druid is good choice for below use cases:
Insert rates are very high, but updates are less common
Most of queries are aggregation and reporting with low latency of 100ms to a few seconds.
Data has a time component
Load data from Kafka, HDFS, flat files, or object storage like Amazon S3
Druid is not good choince for below use cases
Need low-latency updates of existing records using a primary key. Druid supports streaming inserts, but not streaming updates
Building an offline reporting system where query latency is not very important.
In case of big joins
So if you are looking for offline reporting system where query latency is not important, Hadoop may score in that scenario.

Resources