The complete latency for the topology and spout is always coming zero. Also acker is coming zero. The number of acked for bolts are coming fine. I am using storm 1.1.1. My topology is reading text file and classifiying text using Naive Bayes in distributed environment of Apache storm.
I think you need to anchor the tuples in order to be able to track the latency:
http://storm.apache.org/releases/1.0.6/Guaranteeing-message-processing.html
Related
I want to use Apache Storm in one of my project. I have a concern regarding its parallelism technique. By definition we can give hints on how many instances of the components we want to run.
For example if there are 4 executors running the same spout, which itself is supposed to read data from external source and transform it into tuples, how does Storm ensures that no two or many spout get the same data.
Help would be appreciated.
New to Storm and just understanding the concept of Spouts and how to achieve parallelism in them.
I have defined a Spout A and have set 3 tasks and 3 executors and 1 Bolt(Lets not worry about Bolt). Lets assume each of the spout task
is assigned a dedicated worker. That means there are 3 spouts ready to receive a Stream. A message or stream (say X) enters the topology. How is this handled in the Spout?
a. Will all the spouts receive the stream A ? If yes, then all the 3 spouts will process it and the same message is processed multiple times right?
b. Who will decide in above case which spout should receive this stream?
c. Is it possible to balance the load across the spouts?
d. Is it that there should be only one spout in the topology ?
P.S: Consider this is general spout, not to confuse with the Kafka spouts.
Storm is just a frame, your questions are basically determined by implementation of spout code. So,sadly, there is no way to consider "general spout". We have to discuss some specific spout.
Let's take kafka spout for example. Basically, it has no difference with normal kafka consumer. Kafka spout has a logic to distribute partitions to different spout tasks, and load balance is also handled at this period, one partition will be consumed by only one spout task,so there will be no multiple data.
I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.
I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue
As per the given link the capacity of a bolt is the percentage of time spent in executing. Therefore this value should always be smaller than 1. But in my topology I have observed that it is coming over 1 in some cases. How is it possible and what does it mean ?
http://i.stack.imgur.com/rwuRP.png
It means that your bolt is running over capacity and your topology will fall behind in processing if the bolt is unable to catch up.
When you see a bolt that is running over (or close to over) capacity, that is your clue that you need to start tuning performance and tweaking parallelism.
Some things you can do:
Increase the parallelism of the bolt by increasing the number of executors & tasks.
Do some simple profiling within your slow bolts to see if you have a performance problem.
You can get more detail about what's happens in your bolts using Storm Metrics
https://storm.apache.org/documentation/Metrics.html