What is the benefit of using multiple task in an executor in storm topology. I mean I couldn't understand that except doing multiple thing, we can achieve any speed or parallelism?
Michael G.Noll wrote a great tutorial that should help you to understand storm parallelism.
Usually a topology runs one task per executor. However since you cannot increase the number of tasks while a topology is running you can declare multiple tasks per executor in order to scale up parallelism over time.
There is no specific use case to have multiple tasks per executor other than the possibility to increase the topology parallelism.
Related
I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks (which must be >= parallelism_hint) to split the input data into number_of_task partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.
I am applying parallelism for my storm topology. I have set number of worker node=1.
Example#1
I am setting number of Task and number of executor for particular component as "2".
Example#2: no of tasks < no of executors
I am setting number of Tasks as "1" and number of executor as "2" for particular component.
Example#3: no of tasks > no of executors
I am setting number of Tasks as "5" and number of executor as "1" for particular component.
I am not getting which of the above example will lead to Best parallelism for topology and suggest which one gives benefits of Storm Parallelism? Please help me to understand this.
Did you read this article? https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
To get good performance, you should set the number of executors to the number of available cores (each executors runs in an own thread). Using more tasks than executors is only beneficial if you want to change the parallelism during runtime.
Your "example#2" is no valid configuration: #tasks >= #executors must always be true (otherwise, there would be threads with no work).
I have an EvaluationBolt (for e.g. memory monitoring) and I want to make sure one executor for it runs on every worker process (which in my case is one per physical node, i.e. supervisor.slots.ports is configured to only port 6700). On the topic I found this question:
How bolts and spouts are shared among workers?
But it does not state how and if I myself can control distribution of bolts and spouts. Can one somehow configure the scheduler manually?
Cheers,
Tomi
The complicated and correct route is to write a Storm scheduler: http://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/.
What I've also found is that the Storm scheduler by default performs round-robin scheduling between hosts, so most of the time you can just use the built-in scheduler to distribute your tasks equally over all hosts.
I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.
I'm trying to learn twitter storm by following the great article "Understanding the parallelism of a Storm topology"
However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ? I think having multiple tasks in a thread, since a thread executes sequentially, only make the thread a kind of "cached" resource, which avoids spawning new thread for next task run. Am I correct?
I may clear those confusion by myself after taking more time to investigate, but you know, we both love stackoverflow ;-)
Thanks in advance.
Disclaimer: I wrote the article you referenced in your question above.
However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Yes, and yes.
Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?
Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.
As I wrote in the article please note that:
The number of executor threads can be changed after the topology has been started (see storm rebalance command).
The number of tasks of a topology is static.
And by definition there is the invariant of #executors <= #tasks.
So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.
Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
In practice we normally we run 1 task per executor.
PS: Note that Storm will actually spawn a few more threads behind the scenes. For instance, each executor has its own "send thread" that is responsible for handling outgoing tuples. There are also "system-level" background threads for e.g. acking tuples that run alongside "your" threads. IIRC the Storm UI counts those acking threads in addition to "your" threads.