Storm parallel understanding - parallel-processing

I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!

Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.

In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue

Related

Flink how are partitions of a stream associated with the parallelism?

I am new to Flink and I'm trying to understand a few things. I've got a theory which I am trying to confirm. So it goes like that:
Parallelism refers to how many parallel "machines" (could be threads or different machines as I understand, correct me if I'm wrong) will run my job.
Flink by default will partition the stream in a round-robin manner to take advantage of the job's parallelism.
If the programmer defines a partitioning strategy (for example with keyBy) then this strategy will be followed instead of the default round-robin.
If the parallelism is set to 1 then partitioning the stream will not have any effect on the processing speed as the whole stream will end up being processed by the same machine. In this case, the only benefit of partitioning a stream (with keyBy) is that the stream can be processed in keyed context.
keyBy guarantees that the elements with the same key (same group) will be processed by the same "machine" but it doesn't mean that this machine will only process elements of this group. It could process elements from other groups as well but it processes each group as if it is the only one, independently from the others.
Setting a parallelism of 3 while the maximum number of partitions that my partition strategy can spawn is 2, is kind of meaningless as only 2 of the 3 "machines" will end up processing the two partitions.
Can somebody tell me if those points are correct? Correct me if I'm wrong please.
Thank you in advance for your time
I think you've got it. To expand on point 6: If your job uses a keyBy to do repartitioning, as in
source
.keyBy(...)
.window(...)
.sinkTo(...)
then in a case where the source is a Kafka topic with only 2 partitions,
the source operator will only have 2 active instances, but for the window and sink all 3 instances will have meaningful work to do (assuming there are enough distinct keys).
Also, while we don't talk about it much, there's also horizontal parallelism you can exploit. For example, in the job outlined above, the source task will run in one Java thread, and the task with the window and sink will run in another thread. (These are separate tasks because the keyBy forces a network shuffle.) If you give each task slot enough hardware resources, then these tasks will be able to run more-or-less independently (there's a bit of coupling, since they're in the same JVM).

Where does Apache Storm store tuples before a node is available to process it?

I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.

Creating threads in Storm Bolt

I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.

Storm data structures - map vs separated values?

I'm using Storm to parse and save data from Kafka. The data comes in as some identifiers and then a map<string,string> of varying size. After some munging the end goal is Cassandra.
Should I send the data as one block of tuples or split up the map and send each piece separately?
A tuple should represent a "unit of work" for the next bolt in the stream. If you think of your map as a single entity that gets processed as a single, albeit complex, object then the map should be emitted as a single tuple. If you want different bolts independently processing different map attributes, then break the map into subsequently processable subsets of attributes and emit multiple tuples.
It depends on the size of the tuple you want to send.
Every tuple you emit in Storm will be taken as a serialized message to transmit from one executor to another. You should also take the performance of Netty and LMAX into consideration, since they are used in the latest version of Storm for Inter-worker communication and Intra-worker communication. That is, settings like
Config.TOPOLOGY_RECEIVER_BUFFER_SIZE
Config.TOPOLOGY_TRANSFER_BUFFER_SIZE
Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE
Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE
should be taken into account. You could take a look at Understanding the Internal Message Buffers of Storm for more details.

Is it possible to create multiple spouts in one topology? how?

I have two topics BACKUPDATA and LIVEDATA.
what is the best solution for read both topics??
1. Two different topologies?
2. One topology with two spouts?
I tried with two different topology but storm not allocating slots to second topology.
Yes, you can use multiple spouts in a topology.
builder.setSpout("kafka-spout1", new KafkaSpout(spoutConf1), 1);
builder.setSpout("kafka-spout2", new KafkaSpout(spoutConf2), 1);
Well, configuration depends on how you process the data.
If you create separate topology for both, so one topology failure issue won't affect another one, but It'll affect the running cost.
And in case of single topology with multiple spout, both will be affected with each-other failures. If you want to club the data from both topics at the same time, you should use multiple spouts.

Resources