Apache Storm : maximum tuple size - limit

What is the maximum tuple size Apache Storm can handle (if any)?
I couldn't find any information about this in official Storm documentation, which leads me to think there is no limit (beside the heap size of course). On the storm-user mailing list, I found a thread asking about this, but it turned out the user's error was due to a serialization issue.
My topolgy needs to process tuples of a few megabytes and send them to Kafka. I'm currently hitting Kafka limit of 1MB, but before working on it, I want to know if there is any limit on Storm's side.

I don't believe there's a limit. Storm uses either Java serialization (not recommended for production workloads, but useful for prototyping) or Kryo to send tuples between workers.

Related

databricks Structured Streaming: Where are the reliable receivers on streaming architecture

All, trying to understand the Databricks Structured Streaming architecture.
Is this architecture diagram relevant for Structured Streaming as well?
If so here are my questions:
Q1: I see here the concept of reliable recievers.Where do these reliable recievers live? On the driver or worker. In otherwords, the reading to the source happens from the worker or driver?
Q2: As we see in the spark streaming official diagram, a reciever is a single machine that receives records. So if we have 20 partitions in EventHub Source, are we limited by the Driver's Core Restriction for the maximum concurrent reads? Otherwords, we can only perform concurrent reads to source not parallel?
Q3: Related to Q2, does this mean the parallelism in structured streaming can be achieved only for processing?
The below is my version of the architecture, please let me know if this needs any changes.
Thanks in advance.
As per my understanding from the spark streaming documentation
Answer for Q1 : The receivers live on the worker nodes
Answer for Q2 : Since the receivers run on workers, in case of a cluster, the driver's cores does not limit the receivers. Each receiver occupies a single core and gets allocated by a round-robin
Answer for Q3 : Read parallelism can be achieved by increasing the number of receivers/partitions on the source
These info is documented here
Please correct me if this is incorrect. Thanks.

Throughput for Kafka, Spark, Elasticsearch Stack on GCP/Dataproc

I'm working on a research project where I installed a complete data analysis pipeline on Google Cloud Platform. We estimate unique visitors per URL in real-time using HyperLogLog on Spark. I used Dataproc to set up the Spark Cluster. One goal of this work is to measure the throughput of the architecture depending on the cluster size. The Spark cluster has three nodes (minimal configuration)
A data stream is simulated with own data generators written in Java where I used the kafka producer API. The architecture looks as follows:
Data generators -> Kafka -> Spark Streaming -> Elasticsearch.
The problem is: As I increase the number of produced events per second on my data generators and it goes beyond ~ 1000 events/s the input rate in my Spark job suddenly collapses and begin to vary a lot.
As you can see on the screenshot from the Spark Web UI, the processing times and scheduling delays keep constant short, while the input rate goes down.
Screenshot from Spark Web UI
I tested it with a complete simple Spark job which only does a simple mapping, to exclude causes like slow Elasticsearch writes or problems with the job itself. Kafka also seems to receive and send all the events correctly.
Furthermore I experimented with the Spark configuration parameters:
spark.streaming.kafka.maxRatePerPartition and spark.streaming.receiver.maxRate
with the same result.
Does anybody have some ideas what goes wrong here? It really seems to be up to the Spark Job or Dataproc... but I'm not sure. All CPU and memory utilizations seem to be okay.
EDIT: Currently I have two kafka partitions on that topic (placed on one machine). But I think Kafka should even with only one partition do more than 1500 Events/s. The problem also was with one partition at the beginning of my experiments. I use direct approach with no receivers, so Spark reads with two worker nodes concurretly from the topic.
EDIT 2: I found out what causes this bad throughput. I forgot to mention one component in my architecture. I use one central Flume agent to log all the events from my simulator instances via log4j via netcat. This flume agent is the cause of the performance problem! I changed the log4j configuration to use asynchronuous loggers (https://logging.apache.org/log4j/2.x/manual/async.html) via disruptor. I scaled the Flume agent up to more CPU cores and RAM and changed the channel to a file channel. But it still has a bad performance. No effect... any other ideas how to tune Flume performance?
Hard to say given the sparse amount of information. I would suspect a memory issue - at some point, the servers may even start swapping. So, check the JVM memory utilizations and swapping activity on all servers. Elasticsearch should be capable of handling ~15.000 records/second with little tweaking. Check the free and committed RAM on the servers.
As I mentioned before CPU and RAM utilizations are totally fine. I found out a "magic limit", it seems to be exactly 1500 events per second. As I exceed this limit the input rate immediately begins to wobble.
The misterious thing is that processing times and scheduling delays stay constant. So one can exclude backpressure effects, right?
The only thing I can guess is a technical limit with GCP/Dataproc... I didn't find any hints on the Google documentation.
Some other ideas?

Where does Apache Storm store tuples before a node is available to process it?

I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.

Tools to test a BigData Pipeline end to end?

I have this pipeline: Webserver+rsyslog->Kafka->Logstash->ElasticSearch->Kibana
I have found these tools to help test my pipeline:
Generate webserver load by spinning up jmeter EC2 instances with jmeter-ec2
Generate load on Kafka and help graph throughput with Sangrenel
I am wondering if anyone had any other suggestions for testing components or end-to-end testing? Thanks.
Great question! I am looking for something similar but may settle on a simple home solution.
Set up Storm cluster with bolts writing data to Kafka. One thing to watch out for is the id/key so your messages are distributed across multiple partitions. The reason for Storm is to have distributed set of publishers. As alternative to Storm, you can have multiple producers with lets say KafkaAppender
Once you know your Kafka performance, connect Logstash to loaded topic and let it drain as fast as possible. You may find some useful information with KafkaManager or connecting to JMX (many tools for that)
Simplest way to monitor Elastic is Marvel
Performance of Kibana depends on amount of data your query returns but the smallest interval is still 5 sec.
In my experience, logstash performance will depend on data size and grok complexity. The performance of Elastic is mostly cluster size, shard/template configuration. The fastest component in your setup will always be Kafka (bounded by ack and Zookeeper settings)
Also, if you control data generation, you may compare time of record generated vs #timestamp of logstash and measure lagging.

Capacity of Bolt in Apache Storm is over 1

As per the given link the capacity of a bolt is the percentage of time spent in executing. Therefore this value should always be smaller than 1. But in my topology I have observed that it is coming over 1 in some cases. How is it possible and what does it mean ?
http://i.stack.imgur.com/rwuRP.png
It means that your bolt is running over capacity and your topology will fall behind in processing if the bolt is unable to catch up.
When you see a bolt that is running over (or close to over) capacity, that is your clue that you need to start tuning performance and tweaking parallelism.
Some things you can do:
Increase the parallelism of the bolt by increasing the number of executors & tasks.
Do some simple profiling within your slow bolts to see if you have a performance problem.
You can get more detail about what's happens in your bolts using Storm Metrics
https://storm.apache.org/documentation/Metrics.html

Resources