I want to see performance of each bolt and decide the number of parallelism.
In storm UI there are several fields which is confusing, so would be glad if you can tell me.
Capacity(last 10m) - average capacity per one second in last 10 minute of a single executor?
For example, if Capcity is 1.2, does that mean single executor processed 1.2 messages per second in average?
Execute latency and Process latency - Is it average value or value of last processed message?
and what is the difference between them?
and what is the difference between them and Capacity?
I have found a great article describing the Storm UI. You can reach it by link: http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
So, we have that:
capacity (last 10m) – If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt’s parallelism. This is (number executed * average execute latency) / measurement time.
Execute latency (ms) – The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple.
Process latency (ms) – The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received.
Also, I found that documentation for the Storm UI REST API can be very useful for understanding the fields meaning: https://github.com/apache/storm/blob/master/STORM-UI-REST-API.md
Related
In my topology I see around 1 - 2 ms latency when transferring tuples from spouts to bolts or from bolts to bolts. I am calculating latency using nanosecond timestamps because the whole topology runs inside a single worker.
Topology is run in a cluster which runs in a production capable hardware.
To my understanding, tuples need not be serialized/de-serialized in this case as everything is inside single JVM. I have set parallelism hint for most spouts and bolts to 5 and spouts only produce events at a rate of 100 per second. I dont think high latency is due to queuing of events because I dont see any increase of latency with time. No memory increase either. log levels are set to ERROR. CPU usage is in the range of 200 to 300 %.
what could be causing this latency? I was expecting only few us's for tuple transfer.
I'm going to assume you're using one of the released Storm versions, and not 2.0.0-SNAPSHOT, since the queueing implementation has changed in that version.
I think it's likely that the delay is because Storm batches up tuples before delivering them to the consumer. Take a look at https://github.com/apache/storm/blob/v1.2.1/storm-core/src/jvm/org/apache/storm/utils/DisruptorQueue.java#L247, and also look at the Flusher class in that file. When a spout/bolt publishes a tuple, it is put into the _currentBatch list. It stays there until either enough tuples have been received so the batch is "big enough" (you can look at the _inputBatchSize variable to figure out when this is), or until the Flusher is triggered (happens by default once per millisecond).
About storm metric. I do not understand the relationship between send queue arrival rate and receive queue arrival rate.
For example, when open ACK, if a spout receive one tuple , and it emit one tuple. whether the RQ arrival rate : SQ arrival rate = 1:2?
Besides, if system not stable. this Equation may be change?
Spout instances in Storm do not have a receive queue (only a send queue)? I assume you are referring to bolts?
Although it is a little old this article by Michael Noll gives a good overview of the internal queues within the workers.
To answer your question. The ratio between the queues will not always be 2:1. The disruptor queues report their metrics averaged over the user configurable topology.builtin.metrics.bucket.size.secs so this will obscure some of the difference. Also all metrics are subject to a sample ratio, set by the topology.stats.sample.rate config variable - which by default is only 20% of transferred tuples, this can also cause the reported numbers to be off.
Also, depending on the code in your bolts, 1 input tuple may produce many output tuples so you would have to take this into account in any ratios you were calculating.
You refer to the stability of an equation in your question. The arrival rate is not based on any queuing theory equation and is simply the number of tuples that are put on the queue in a metric.bucket period divided by the period length in seconds. However, Storm does report a queue sojourn time metric. This is based on a very simple queuing theory equation that is not reliable for unstable queue systems and should be avoided.
How 'capacity' is calculated?
From their documentation
The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.
I don't quite understand % of time. So if the value is 0.75 - what does it really mean?
It's the percent of time that the bolt is busy vs idle. 0.75 would mean that 25% of the time is waiting for new data to be processed.
Lets say you receive a new input tuple every second but your bolt takes 0.1 seconds to process it, the bolt will be idle 90% of the time and the capacity will be 0.1.
Another example: Imagine you receive more data in real time that you can process and you have two bolts and the task that is doing the first bolt takes more time than the second so the first bolt is your bottleneck. The capacity of the first bolt will be around 1 and the capacity of the second will be below 1.
In both examples above, then you can determine the parallelism (or processing power) that you need to set up for each bolt by looking at this number.
If the first bolt capacity is 1 and the second is 0.5 you probably want to set up twice the number of executors to the first bolt than two the second. At the same time (and most importantly), you have to increase the number of executors until that bolt capacity is below 1, so you are sure that your topology is able to keep up and process all the data that is coming in real time.
Topology with 1 executor assigned to Query Normalizer
Topology with 4 executor assigned to Query Normalizer
Initially I was running my topology with only 1 executor assigned to QueryNormalizer. The execute latency was 8.952 and process latency was 12.857.
To make it faster I changed the number of executors in QueryNormalizer to 4.The execute latency changed to 197.616 and process latency to 59.132.
According to the definition of Execute latency – The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple.
So, What I understand is it should be low if I increase the number of executors.As the parallelism should increase as the executor increases.
Am I misinterpreting something ?
Also, there is a huge difference between the emitted,transmitted and executed fields. Is this normal ?
Also, Should process latency be always lower than the execute latency ?
Which of the above shown topologies are better performance wise ? Also, How should I decide which topology is running better than the other , seeing the bolts data ?
Have a look at "complete latency" in the spout, that is the value the tuples spend in average inside in your topology, it had decreed.
So, What I understand is it should be low if I increase the number of executors.As the parallelism should increase as the executor increases.
it means you have now 4 units processing tuples, each unit process 1 tuple at the time, "theoretically" let you process 4 tuples at the same time instead of 1. Do your tuples look always the the same? this is, do they have always the same complexity?
Also, there is a huge difference between the emitted,transmitted and executed fields. Is this normal ?
executed means how many tuples your bolt consumed; emitted means how many tuples your bolt generated (in your case i see each consumed tuple is generating around 4 new tuples); transfered means how many emitted tuples were transfered to other bolts, for example you have two bolts consuming from the bolt emitting, in this case transfered would be equal a 2 * nr of tuples emitted.
Also, Should process latency be always lower than the execute latency ?
Not necessaly, have for example at Nathan Marz definition:
Process latency is time until tuple is acked, execute latency is time spent in execute for a tuple
and I can give you an example of one of my topologies where this does not happen:
Which of the above shown topologies are better performance wise ? Also, How should I decide which topology is running better than the other , seeing the bolts data ?
well let them run for a longer period of time. Both processed less than 1000 tuples, the size of the sample is too small. Ultimately the metric is the "complete latency" on the spout and the number of failed tuples.
I am working on apache storm , i see there is a huge difference between executed and acked .
Following is the screenshot from Storm UI
What can we do to make acks equal to executed , i tried increasing the number of packers but that was of no help
to make it clear, I would like try to explain the two values' meaning. "Executed" represents how many times execute method is called for the bot. "Acked" means how many times the bolt calls ack.
From the snapshot above, it means booking_bolt executes "execute" method for 23300 times and call acked only 500 times.
So maybe in bolt's execute method, ack or fail is not called everytime.
From Michael G. Noll training : Why does the Storm UI report seemingly incorrect numbers?
Storm samples incoming tuples when computing statistics in order to increase performance.
Sample rate is configured via topology.stats.sample.rate.
0.05 is the default value
Here, Storm will pick a random event of the next 20 events in which to increase the metric count by 20. So if you have 20 tasks for that bolt, your stats could be off by +/- 380.
1.00 forces Storm to count everything exactly
This gives you accurate numbers at the cost of a big performance hit. For testing purposes however this is acceptable and often quite helpful.