Apache storm - Map topology with storm cluster - apache-storm

I read many sites related to Storm.
But still I cannot map topology into storm cluster perfectly.
Please help me to understand this.
In storm cluster there are terms like
Supervisor
Worker node
Worker processor
Workers
Slots
Executer
Tasks
In topology, there are
Spout
Bolt
Also there is possible to configure
numWorkers
parallelism
So anyone please relate all these thing to help me.
I want to know like, each spout/bolt is act executer or is it the task.
If parallelism hint is given, the count of which entity will increase.
If num workers set, which one's count is that.
All these things to map with storm cluster.
I already worked in a project. So I know the topology.

Physical Cluster Setup:
The term node usually refers to a physical machine (or a VM) in your cluster. On each node a supervisor is running in an own JVM. Supervisor have worker slots. This is a logical configuration and tells how many worker can be started by a supervisor. Each worker (if started) runs in an own JVM (thus, some people call it worker process). In summary: on a node there is one supervisor JVM and up to number-of-worker-slots worker JVMs. Therefore, the node a worker JVM is running on, can be called worker node. While the supervisor is running all the time, workers are started if needed, ie, if topologies are deployed, and stopped when a topology is killed. Within a worker, executors are running as threads (ie, each executor maps to an own thread).
Logical Topology Setup:
Topologies are build out of Spouts (also called sources, ie, operators with no incoming data stream) and Bolts (regular operators with at least one incoming data stream and any number of outgoing data streams -- if there is no outgoing data stream, a Bolt is also called sink). For each Spout/Bolt you can configure two parameters:
the number of tasks
the dop (degree of parallelism, called parallelism_hint), ie, the number of executors you want to have for a Spout/Bolt
Tasks are logical unit of works (ie, something passive). Let's assume you use fieldsGrouping connection pattern. Thus, the data stream is partitioned into number-of-tasks many sub-streams. Tasks are assigned to executors, ie, each executor processes one or multiple tasks. This implies, that you cannot have less tasks than executors (ie, parallelism); otherwise, there would be a thread without any work to do.
See the Storm documentation for further details (https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html). Furthermore, there are many other question on SO about task/executors in Storm.
Last but not least, you can configure the numberOfWorkers for a topology. This parameter indicates, how many workers should be started to run the topology. The overall number of executors for a topology is the sum of dops over all Spouts/Bolts. All executors will be evenly distributes over all available worker JVMs.
Furthermore, a single worker can only run executors of a single topology. This is done for fault-tolerance reasons, ie, topologies are isolated from each other. At the same time, a worker itself can run any number of executors.

Related

Apache storm: why and how to choose number of tasks per executor?

According to the official documentation:
How many instances to create for a spout/bolt. A task runs on a thread with zero or more other tasks for the same spout/bolt. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task)
My questions are:
Under what circumstances would I choose to run multiple tasks in one executor?
If I do use multiple tasks in one executor, what might be reasons that I would choose different number of tasks per executor between my spout and my bolt (such as 2 tasks per bolt executor but only 1 task per spout executor)?
I thought https://stackoverflow.com/a/47714449/8845188 was a fine answer, but I'll try to reword it as examples:
The number of tasks for a component (e.g. spout or bolt) is set in stone when you submit the topology, while the number of executors can be changed without redeploying the topology. The number of executors is always less than or equal to the number of tasks for a component.
Question 1
You wouldn't normally have a reason to choose running e.g. 2 tasks in 1 executor, but if you currently have a low load but expect a high load later, you may choose to submit the topology with a high number of tasks but a low number of executors. You could of course just submit the topology with as many executors as you expect to need, but using many threads when you only need a few is inefficient due to context switching and/or potential resource contention.
For example, lets say you submit your topology so the spout has 4 tasks and 4 executors (one per). When your load increases, you can't scale further because 4 is the maximum number of executors you can have. You now have to redeploy the topology in order to scale with the load.
Let's say instead you submit your topology so the spout has 32 tasks and 4 executors (8 per). When the load increases, you can increase the number of executors to 32, even though you started out with only 4. You can do this scaling up without redeploying the topology.
Question 2
Let's say your topology has a spout A, and a bolt B. Let's say bolt B does some heavyweight work (e.g. can do 10 tuples per executor per second), while the spout is lightweight (e.g. can do 1000 tuples per executor per second). Let's say your load is initially 20 messages per second into the topology, but you expect that to grow.
In this case it makes sense that you might configure your spout with 1 executor and 1 task, since it's likely to be idle most of the time. At the same time you want to configure your bolt with a high number of tasks so you can scale the number of executors for it, and at least 2-3 executors to start.
Config#TOPOLOGY_TASKS -> How many tasks to create per component.
A task performs the actual data processing and is run within its parent executor’s thread of execution. Each spout or bolt that you implement in your code executes as many tasks across the cluster.
The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads <= #tasks.
By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread (which is usually what you want anyways).
Also be aware that:
The number of executor threads can be changed after the topology has been started.
The number of tasks of a topology is static.
There is another reason where having tasks in place of executors makes more sense.
Lets suppose you have 2 tasks of the same bolt running on a single executor(thread). Lets suppose you are calling a relatively long running(1 second maybe) database subroutine and the result is needed before proceeding further.
Case 1 - Your database call would be running on the executor thread and it would pause for a while and you would not gain anything by running 2 tasks.
Case 2 - You refactor your database call code to spawn a new thread and execute. In this case, your main executor thread would not hang and it would be able to start processing of the second bolt task while the newly spawned thread would be fetching data from database.
Unless you introduce your own parallelism within the component, I do not see a performance gain and no reason to run multiple tasks apart from maintenance reasons as mentioned in other answers.

Apache Storm: Is the topology replicated on atleast one worker on a Supervisor Node?

I have just started learning about Apache Storm. One thing that I cannot understand is whether the entire topology is replicated on atleast one worker process on a supervisor node. If it is the case, then a component in the topology which is very compute intensive (and possibly gives better(performance) executed on a single machine by itself), is a potential bottleneck? If not, I assume Nimbus in a way "distributes" parts of topology across the cluster. How does it know how to optimally "distribute" the topology?
Storm does not replicate a topology. If you deploy a topology, all executor threads are distributed evenly over all worker nodes (using a round-robin scheduling mechanism). the number of worker nodes a topology can use, can be configured via Config.setNumWorkers(int);.
If you have a compute intensive bolt and you want to ensure that it is deployed to an own worker, you would need to implement a custom scheduler. See her for more details: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/

Issues with storm execution in single node

We have the Storm configured in a single node development server with most of the configurations set to default (not local mode).
Having storm nimbus, supervisor and workers running in that single node only and UI also configured.
AFAIK parallelism and configuration differs from topology to topology.
I think finding the right parallelism and configuration is by trial and error method only.
So, to find the best parallelism we have started testing our Storm topology with various configurations in a single node.
Strangely the results are unexpected:
Our topology processes stream of xml files from HDFS directory.
Having a single spout (Parallelism always 1) and four bolts.
Single worker
Whatever the topology parallelism we get the almost same performance results (the rate of data processed)
Multiple workers
Whatever the topology parallelism we get the similar performance as of single worker until sometime (most of the cases it is 10 minutes).
But after that complete topology gets restarted without any error traces.
We had observed that Whatever data processed in 20 minutes with single worker took 90 minutes with 5 workers having the same parallelism.
Also Topology had restarted 7 times with 5 workers.
And CPU usage is relatively high.
(Someone else also had faced this topology restart issue http://search-hadoop.com/m/LrAq5ZWeaU but no answer)
After testing many configurations we found that single worker with less no of parallelism (each bolt with 2 or 3 instances) works better than high parallelism or more no of workers.
Ideally the performance of Storm topology should be better with more no workers/ parallelism.
Apparently this rule is not holding good here.
why can't we set more than a single worker in a single node?
What are the maximum no of workers can be run in a single node?
What are the Storm configurations changes that are need to scale the performance? (I have tried nimbus.childopts and worker.childopts)
If your CPU usage is high on the one node then you're not going to get any better performance as you increase parallelism. If you do increase parallelism, there will just be greater contention for a constant number of CPU cycles. Not knowing any more about your specific topology, I can only suggest that you look for ways to reduce the CPU usage across your bolts and spouts. Only then can you would it make sense to add more bolt and spout instances.

Controling distribution of bolts in Storm?

I have an EvaluationBolt (for e.g. memory monitoring) and I want to make sure one executor for it runs on every worker process (which in my case is one per physical node, i.e. supervisor.slots.ports is configured to only port 6700). On the topic I found this question:
How bolts and spouts are shared among workers?
But it does not state how and if I myself can control distribution of bolts and spouts. Can one somehow configure the scheduler manually?
Cheers,
Tomi
The complicated and correct route is to write a Storm scheduler: http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/.
What I've also found is that the Storm scheduler by default performs round-robin scheduling between hosts, so most of the time you can just use the built-in scheduler to distribute your tasks equally over all hosts.

What are the reasons to configure more than one worker per cluster node in Apache Storm?

In the following, I refer to this article: Understanding the Parallelism of a Storm Topology by Michael G. Noll
It seems to me that a worker process may host an arbitrary number of executors (threads) to run an arbitrary number of tasks (instances of topology components). Why should I configure more than one worker per cluster node?
The only reason I see is that a worker can only run a subset of at most one topology. Hence, if I want to run multiple topologies on the same cluster, I would need to configure the same number of workers per cluster node as the number of topologies to be run.
(Example: This is because I would want to be flexible in case that some cluster nodes fail. If for example, only one cluster node remains I need at least as many worker processes as topologies running on that cluster in order to keep all topologies running.)
Is there any other reason? Especially, is there any reason to configure more than one worker per cluster node if running only one topology? (Better fail-safety, etc.)
To balance the costs of a supervisor daemon per node, and the risk of impact of a worker crashing. If you have one large, monolithic worker JVM, one crash impacts everything running in that worker, as well as bad behaving portions of your worker impact more residents. But by having more than one worker per node, you make your supervisor more efficient, and now have a bulkhead pattern somewhat, keeping from the all or nothing approach.
The shared resources I refer to could be yours or storm's; several pieces of storm's architecture are shared per JVM, and could create contention problems. I refer to the receive and send threads, and underlying network pieces, specifically. Documented here.

Resources