I have a question related to Storm Functionality. Suppose I have a spout which is reading a csv file and emits records chunk by chunk. That is, it emits 100 records at a time to the bolt.
My question is that whether a single chunk when received by the bolt will be sent to only one executor or will be divided among different executors for the sake of parallelism.
Note : The bolt has 5 executors.
What do you mean by "it emits 100 record at a time"? Does it mean, that a single tuple contains 100 CSV lines? Or do you emit 100 tuples (each containing a single CSV line) in a single nextTuple() call.
For the first case, Storm cannot parallelize those 100 line within a single tuple. Storm can only send different tuples to different executors.
For the second case, Storm will send the 100 tuples to different executors (of course, depending on the connection pattern you have chosen).
One side remark: it is considered bad practice to emit multiple tuples in a single call to nextTuple(). If nextTuple() blocks for any reason, the spout thread is blocked and cannot (for example) react on incoming acks. Best practice is, to emit a single tuple for each call to nextTuple(). If no tuple is available to be emitted, you should return (without emitting) and not block, to wait until a tuple is available.
executor = Threads
If you do not explicitly configure the number of tasks (instances) then storm will run 1 task per executor by default. So practically what happen is there are 5 different instances of the bolt is running which is handled by 5 different threads (1 thread handling 1 task).
So ideally the tuples you emit will be processed by 5 different threads
simultaneously
Related
I am new to Apache storm and wondering how parallelism hint works.
For e.g. We have one stream containing two tuples <4>,<6>, one spout with only one task per executor and we have one bolt to perform some operation on the tuples and having parallelism hint as 2, so we have two executor of this bolt namely A and B, regarding this, I have 3 questions.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
Yes.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
It depends. You most likely have control over the sequence of the tuples in your spout. If sequence matters, it is advisable to either reduce parallelism or use fields grouping, to make sure tuples which depend on each other go to the same executor. If sequence does not matter use shuffleGrouping or localOrShuffleGrouping to get benefits from parallel processing.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
If both tuples go to the same executor, there is no benefit, obviously.
According to the official documentation:
How many instances to create for a spout/bolt. A task runs on a thread with zero or more other tasks for the same spout/bolt. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task)
My questions are:
Under what circumstances would I choose to run multiple tasks in one executor?
If I do use multiple tasks in one executor, what might be reasons that I would choose different number of tasks per executor between my spout and my bolt (such as 2 tasks per bolt executor but only 1 task per spout executor)?
I thought https://stackoverflow.com/a/47714449/8845188 was a fine answer, but I'll try to reword it as examples:
The number of tasks for a component (e.g. spout or bolt) is set in stone when you submit the topology, while the number of executors can be changed without redeploying the topology. The number of executors is always less than or equal to the number of tasks for a component.
Question 1
You wouldn't normally have a reason to choose running e.g. 2 tasks in 1 executor, but if you currently have a low load but expect a high load later, you may choose to submit the topology with a high number of tasks but a low number of executors. You could of course just submit the topology with as many executors as you expect to need, but using many threads when you only need a few is inefficient due to context switching and/or potential resource contention.
For example, lets say you submit your topology so the spout has 4 tasks and 4 executors (one per). When your load increases, you can't scale further because 4 is the maximum number of executors you can have. You now have to redeploy the topology in order to scale with the load.
Let's say instead you submit your topology so the spout has 32 tasks and 4 executors (8 per). When the load increases, you can increase the number of executors to 32, even though you started out with only 4. You can do this scaling up without redeploying the topology.
Question 2
Let's say your topology has a spout A, and a bolt B. Let's say bolt B does some heavyweight work (e.g. can do 10 tuples per executor per second), while the spout is lightweight (e.g. can do 1000 tuples per executor per second). Let's say your load is initially 20 messages per second into the topology, but you expect that to grow.
In this case it makes sense that you might configure your spout with 1 executor and 1 task, since it's likely to be idle most of the time. At the same time you want to configure your bolt with a high number of tasks so you can scale the number of executors for it, and at least 2-3 executors to start.
Config#TOPOLOGY_TASKS -> How many tasks to create per component.
A task performs the actual data processing and is run within its parent executor’s thread of execution. Each spout or bolt that you implement in your code executes as many tasks across the cluster.
The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads <= #tasks.
By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread (which is usually what you want anyways).
Also be aware that:
The number of executor threads can be changed after the topology has been started.
The number of tasks of a topology is static.
There is another reason where having tasks in place of executors makes more sense.
Lets suppose you have 2 tasks of the same bolt running on a single executor(thread). Lets suppose you are calling a relatively long running(1 second maybe) database subroutine and the result is needed before proceeding further.
Case 1 - Your database call would be running on the executor thread and it would pause for a while and you would not gain anything by running 2 tasks.
Case 2 - You refactor your database call code to spawn a new thread and execute. In this case, your main executor thread would not hang and it would be able to start processing of the second bolt task while the newly spawned thread would be fetching data from database.
Unless you introduce your own parallelism within the component, I do not see a performance gain and no reason to run multiple tasks apart from maintenance reasons as mentioned in other answers.
I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue
I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.
I just wanted to know what is the actual relevance of using tasks in storm with respect to the output or performance since it does not have to do anything with parallelism, so on choosing more than 1 task for a component will make any change in output? or what will be the flow than? Or if i choose no of tasks > executors how does that make difference in flow or the output (here i am just taking the basic word count example).
It would be very helpful if anybody could explain me this with or without example.
for example say-
I have a topology with 3 bolts and 1 spout, and i have mentioned only 2 workers port,than that means that all these 4 components(1 spot and 3 bolts will get run on these workers only) now i have mentioned 2 executors for 1st bolt than it means that there will be 2 thread of that bolt will be running in parallel.Now if i mention the no of task=3 how will this make difference whether in output or performance?
And if i have mentioned the field grouping than the grouping will be there in different executors(plz correct me if m wrong)?
Did you read this article? https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
To pick up your example: If you set #tasks=3 and specify 2 executors using fieldsGrouping the data will be partitioned into 3 substreams (= #tasks). 2 substreams go to one executor and the third to the second executor. However, using 3 tasks and 2 executors, allows you to increase the number of executors to 3 using rebalance command.
As long as you do not want to increase the number of executors during execution, #tasks should be equal to #executors (ie, just don't specify #tasks).
For your example (if you don't want to change the parallelism at runtime), you most likely can an imbalance workload for both executors (one executor processed 33% of the data, the other 66%). However, this is only a problem in this special case and not in general. If you assume you have 4 tasks, each executors processed 2 substreams and no inbalance occurs.