I am new to Apache Storm and trying to design a simple topology for my use case. The explanation for parallelism in Storm (Understanding the Parallelism of a Storm Topology) has left me with two queries:
1) Is it safe to assume that same worker will have the executors for
my spout as well as bolt if i have only one worker?
2) The inter worker communication uses ZeroMQ which uses network for communication as opposed to LMX Disruptors
used for intra-worker communication, which are faster as they are in-memory. Should I create a single worker for better performance?
Please answer the above queries and correct my understanding if incorrect.
1) yes
2) Using one worker per topology per machine is recommended since intra process communication is much more expensive in Storm.
refer to : https://storm.apache.org/documentation/FAQ.html
On my experience basis also, using multiple workers in one machine for same topology have negative impact on throughput.
Related
All, trying to understand the Databricks Structured Streaming architecture.
Is this architecture diagram relevant for Structured Streaming as well?
If so here are my questions:
Q1: I see here the concept of reliable recievers.Where do these reliable recievers live? On the driver or worker. In otherwords, the reading to the source happens from the worker or driver?
Q2: As we see in the spark streaming official diagram, a reciever is a single machine that receives records. So if we have 20 partitions in EventHub Source, are we limited by the Driver's Core Restriction for the maximum concurrent reads? Otherwords, we can only perform concurrent reads to source not parallel?
Q3: Related to Q2, does this mean the parallelism in structured streaming can be achieved only for processing?
The below is my version of the architecture, please let me know if this needs any changes.
Thanks in advance.
As per my understanding from the spark streaming documentation
Answer for Q1 : The receivers live on the worker nodes
Answer for Q2 : Since the receivers run on workers, in case of a cluster, the driver's cores does not limit the receivers. Each receiver occupies a single core and gets allocated by a round-robin
Answer for Q3 : Read parallelism can be achieved by increasing the number of receivers/partitions on the source
These info is documented here
Please correct me if this is incorrect. Thanks.
I have just started learning about Apache Storm. One thing that I cannot understand is whether the entire topology is replicated on atleast one worker process on a supervisor node. If it is the case, then a component in the topology which is very compute intensive (and possibly gives better(performance) executed on a single machine by itself), is a potential bottleneck? If not, I assume Nimbus in a way "distributes" parts of topology across the cluster. How does it know how to optimally "distribute" the topology?
Storm does not replicate a topology. If you deploy a topology, all executor threads are distributed evenly over all worker nodes (using a round-robin scheduling mechanism). the number of worker nodes a topology can use, can be configured via Config.setNumWorkers(int);.
If you have a compute intensive bolt and you want to ensure that it is deployed to an own worker, you would need to implement a custom scheduler. See her for more details: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/
I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks (which must be >= parallelism_hint) to split the input data into number_of_task partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.
We have the Storm configured in a single node development server with most of the configurations set to default (not local mode).
Having storm nimbus, supervisor and workers running in that single node only and UI also configured.
AFAIK parallelism and configuration differs from topology to topology.
I think finding the right parallelism and configuration is by trial and error method only.
So, to find the best parallelism we have started testing our Storm topology with various configurations in a single node.
Strangely the results are unexpected:
Our topology processes stream of xml files from HDFS directory.
Having a single spout (Parallelism always 1) and four bolts.
Single worker
Whatever the topology parallelism we get the almost same performance results (the rate of data processed)
Multiple workers
Whatever the topology parallelism we get the similar performance as of single worker until sometime (most of the cases it is 10 minutes).
But after that complete topology gets restarted without any error traces.
We had observed that Whatever data processed in 20 minutes with single worker took 90 minutes with 5 workers having the same parallelism.
Also Topology had restarted 7 times with 5 workers.
And CPU usage is relatively high.
(Someone else also had faced this topology restart issue http://search-hadoop.com/m/LrAq5ZWeaU but no answer)
After testing many configurations we found that single worker with less no of parallelism (each bolt with 2 or 3 instances) works better than high parallelism or more no of workers.
Ideally the performance of Storm topology should be better with more no workers/ parallelism.
Apparently this rule is not holding good here.
why can't we set more than a single worker in a single node?
What are the maximum no of workers can be run in a single node?
What are the Storm configurations changes that are need to scale the performance? (I have tried nimbus.childopts and worker.childopts)
If your CPU usage is high on the one node then you're not going to get any better performance as you increase parallelism. If you do increase parallelism, there will just be greater contention for a constant number of CPU cycles. Not knowing any more about your specific topology, I can only suggest that you look for ways to reduce the CPU usage across your bolts and spouts. Only then can you would it make sense to add more bolt and spout instances.
I have an EvaluationBolt (for e.g. memory monitoring) and I want to make sure one executor for it runs on every worker process (which in my case is one per physical node, i.e. supervisor.slots.ports is configured to only port 6700). On the topic I found this question:
How bolts and spouts are shared among workers?
But it does not state how and if I myself can control distribution of bolts and spouts. Can one somehow configure the scheduler manually?
Cheers,
Tomi
The complicated and correct route is to write a Storm scheduler: http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/.
What I've also found is that the Storm scheduler by default performs round-robin scheduling between hosts, so most of the time you can just use the built-in scheduler to distribute your tasks equally over all hosts.