Is there any cap on number of tuples generated out of one tuple from a Apache Storm Bolt? - apache-storm

One of our Bolt breaks a large tuple message down into its children and then emits these children as tuples. There can be 10000 children some times.
This bombardment of tuples, choke down our topology.
Is there any cap/ceiling value on the number of tuples generated out of one tuple in a Bolt?
We need to send these children further down the topology so that state of these children can be updated according to state of parent.

There is a cap where Storm's algorithm for tracking tuples breaks down, but that cap is at the point where you start to see common collisions between 64-bit random values. So no, there effectively isn't a cap.
What you might run into is that it takes too long to process all the child tuples, so the whole tuple tree hits the tuple timeout. Either you can increase the timeout, or you can detour over e.g. Kafka so the entire processing doesn't have to happen within a single tuple tree's processing time.
A setup like
Topology A: source -> splitter -> Kafka
Topology B: Kafka -> processing
lets you process each child individually instead of having to handle all 10k tuples within the message timeout of the parent.
Please elaborate if you meant something else by your topology being choked.

Related

Why storm use XOR to ensure every Bolts in topology is successfully executed. Instead of a counter

I am a beginner of storm. Storm's creator created a very impressive method to check every Bolts in topology, which is using XOR.
But I start wondering why he just not use a counter. When a Bolts is successfully executed, the counter will minus one. So when the counter equal with 0, means the whole task is completetly.
Thanks
I believe one can reason why counters are not only inefficient but an incorrect acker tracker mechanism in an always running topology.
Storm tuple topology in itself can be a complex DAG. When a bolt receives ack from multiple downstream sources, what is it to do with the counters? Should it increment them, should it always decrement them? In what order?
Storm tuples have random message Ids. Counters will be finite. A topology runs forever emitting billions of tuples. How will you map the 673686557th tuple to a counter id? With XOR, you only have a single state to maintain and broadcast.
XOR operations are hardware instructions that execute extremely efficiently. Counters are longs which require huge amounts of storage. They have overflow problems and defeat the original requirement of a solution with a low space overhead.

How Apache Storm parallelism works?

I am new to Apache storm and wondering how parallelism hint works.
For e.g. We have one stream containing two tuples <4>,<6>, one spout with only one task per executor and we have one bolt to perform some operation on the tuples and having parallelism hint as 2, so we have two executor of this bolt namely A and B, regarding this, I have 3 questions.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
Yes.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
It depends. You most likely have control over the sequence of the tuples in your spout. If sequence matters, it is advisable to either reduce parallelism or use fields grouping, to make sure tuples which depend on each other go to the same executor. If sequence does not matter use shuffleGrouping or localOrShuffleGrouping to get benefits from parallel processing.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
If both tuples go to the same executor, there is no benefit, obviously.

What happens to tuples which are not acked?

Let's say I have an Apache Storm topology processing some tuples. I ack most of them but sometimes, due to an error, they are not processed and therefore not acked.
What happens to these 'lost' tuples? Does Storm fail them automatically, or should I do that explicitly every time?
From Storm's docs:
http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html
By failing the tuple explicitly, the spout tuple can be replayed faster than if you waited for the tuple to time-out. (=30 seconds by default)
Every tuple you process must be acked or failed. Storm uses memory to track each tuple, so if you don't ack/fail every tuple, the task will eventually run out of memory.
What happens to these 'lost' tuples? Does Storm fail them
automatically
Yes, storm failed them automatically after the tuple timeout. But it's better for you to do that explicitly .

in apache storm what does collector.fail do?

In apache storm what does collector.fail do ?
Does it replay the tuple from the source(spout) or it just replays the tuple from the last bolt from which it was emitted ?
Note: i'm not anchoring my tuples, so what happens in that case?
As states in the documentation Guaranteeing Message Processing, the tuple will be replayed from the spout that generated it.
Each word tuple is anchored by specifying the input tuple as the first argument to emit. Since the word tuple is anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream. In contrast, let's look at what happens if the word tuple is emitted like this:
_collector.emit(new Values(word));
Emitting the word tuple this way causes it to be unanchored. If the tuple fails be processed downstream, the root tuple will not be replayed. Depending on the fault-tolerance guarantees you need in your topology, sometimes it's appropriate to emit an unanchored tuple.

Storm parallel understanding

I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue

Resources