How is the data consistent when we have many instances of the same spout running in Apache Storm - apache-storm

I want to use Apache Storm in one of my project. I have a concern regarding its parallelism technique. By definition we can give hints on how many instances of the components we want to run.
For example if there are 4 executors running the same spout, which itself is supposed to read data from external source and transform it into tuples, how does Storm ensures that no two or many spout get the same data.
Help would be appreciated.

Related

Storm supports task or data parallelism?

I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks (which must be >= parallelism_hint) to split the input data into number_of_task partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.

Storm parallel understanding

I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue

Creating threads in Storm Bolt

I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.

What is difference between TridentTopologies and storm topologies?

I am not getting the good difference between trident topologies and storm topologies,while creating spout i am using BaseRichspout .I am using trident topologies object while creating topologies and its working
So ,some one help me to know a good difference between them
Trident is an abstraction on top of Storm to provide exactly once processing. Storm uses Spouts, Bolts, and Sinks. Trident uses Functions, Filters, and States and transforms these into Bolts. If you're not sure you need Trident, I'd suggest you use Storm and not mix the two.

Is it possible to create multiple spouts in one topology? how?

I have two topics BACKUPDATA and LIVEDATA.
what is the best solution for read both topics??
1. Two different topologies?
2. One topology with two spouts?
I tried with two different topology but storm not allocating slots to second topology.
Yes, you can use multiple spouts in a topology.
builder.setSpout("kafka-spout1", new KafkaSpout(spoutConf1), 1);
builder.setSpout("kafka-spout2", new KafkaSpout(spoutConf2), 1);
Well, configuration depends on how you process the data.
If you create separate topology for both, so one topology failure issue won't affect another one, but It'll affect the running cost.
And in case of single topology with multiple spout, both will be affected with each-other failures. If you want to club the data from both topics at the same time, you should use multiple spouts.

Resources