how to understand arrival rate about apache storm disruptor queue - apache-storm

About storm metric. I do not understand the relationship between send queue arrival rate and receive queue arrival rate.
For example, when open ACK, if a spout receive one tuple , and it emit one tuple. whether the RQ arrival rate : SQ arrival rate = 1:2?
Besides, if system not stable. this Equation may be change?

Spout instances in Storm do not have a receive queue (only a send queue)? I assume you are referring to bolts?
Although it is a little old this article by Michael Noll gives a good overview of the internal queues within the workers.
To answer your question. The ratio between the queues will not always be 2:1. The disruptor queues report their metrics averaged over the user configurable topology.builtin.metrics.bucket.size.secs so this will obscure some of the difference. Also all metrics are subject to a sample ratio, set by the topology.stats.sample.rate config variable - which by default is only 20% of transferred tuples, this can also cause the reported numbers to be off.
Also, depending on the code in your bolts, 1 input tuple may produce many output tuples so you would have to take this into account in any ratios you were calculating.
You refer to the stability of an equation in your question. The arrival rate is not based on any queuing theory equation and is simply the number of tuples that are put on the queue in a metric.bucket period divided by the period length in seconds. However, Storm does report a queue sojourn time metric. This is based on a very simple queuing theory equation that is not reliable for unstable queue systems and should be avoided.

Related

Differnces between __execute-count value and values gathered by the Metrics Reporting API v2

I have run a topology, and I used the Meter type in metric Reporting API v2. In the execute method I mark this metric. So it will mark an event whenever the execute method is called. But when I compare this value with the __execute-count, I see huge differences. Does anyone know why this happens?
These are the values from my log which are gathered at the same time:
9:v7 __execute-count {v0:v7=44500}
9:v7 tuple_inRate.count 664129
Update:
When I use the mark method on the Meter metric, I will get different results in comparison with the Counter metric. But still, I do not understand why the values from the counter metric (tuple counter) are not the same as the __execute-count.
As given in this answer, Storms Internal Metrics are just estimated by a percentage of the real data flow. Initially, it uses 5% of incoming tuples to make those estimations. This may lead to inaccuracies for extreme high or low throughputs.
EDIT: The documentation describes the following:
In general all of these tuple count metrics are randomly sub-sampled unless otherwise stated. This means that the counts you see both on the UI and from the built in metrics are not necessarily exact. In fact by default we sample only 5% of the events and estimate the total number of events from that. The sampling percentage is configurable per topology through the topology.stats.sample.rate config. Setting it to 1.0 will make the counts exact, but be aware that the more events we sample the slower your topology will run (as the metrics are counted in the same code path as tuples are processed). This is why we have a 5% sample rate as the default.
EDIT 2 In this post, there is more information about the estimation:
The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
By the way, execute_count is just an increasing number, while your tuple_inRate.count is a rate, isn`t it?

High latency between spout -> bolt and bolt -> bolts

In my topology I see around 1 - 2 ms latency when transferring tuples from spouts to bolts or from bolts to bolts. I am calculating latency using nanosecond timestamps because the whole topology runs inside a single worker.
Topology is run in a cluster which runs in a production capable hardware.
To my understanding, tuples need not be serialized/de-serialized in this case as everything is inside single JVM. I have set parallelism hint for most spouts and bolts to 5 and spouts only produce events at a rate of 100 per second. I dont think high latency is due to queuing of events because I dont see any increase of latency with time. No memory increase either. log levels are set to ERROR. CPU usage is in the range of 200 to 300 %.
what could be causing this latency? I was expecting only few us's for tuple transfer.
I'm going to assume you're using one of the released Storm versions, and not 2.0.0-SNAPSHOT, since the queueing implementation has changed in that version.
I think it's likely that the delay is because Storm batches up tuples before delivering them to the consumer. Take a look at https://github.com/apache/storm/blob/v1.2.1/storm-core/src/jvm/org/apache/storm/utils/DisruptorQueue.java#L247, and also look at the Flusher class in that file. When a spout/bolt publishes a tuple, it is put into the _currentBatch list. It stays there until either enough tuples have been received so the batch is "big enough" (you can look at the _inputBatchSize variable to figure out when this is), or until the Flusher is triggered (happens by default once per millisecond).

Storm latency caused by ack

I was using kafka-storm to connect kafka and storm. I have 3 servers running zookeeper, kafka and storm. There is a topic 'test' in kafka that has 9 partitions.
In the storm topology, the number of KafkaSpout executor is 9 and by default, the number of tasks should be 9 as well. And the 'extract' bolt is the only bolt connected to KafkaSpout, the 'log' spout.
From the UI, there is a huge rate of failure in the spout. However, he number of executed message in bolt = the number of emitted message - the number of failed mesage in bolt. This equation is almost matched when the failed message is empty at the beginning.
Based on my understanding, this means that the bolt did receive the message from spout but the ack signals are suspended in flight. That's the reason why the number of acks in spout are so small.
This problem might be solved by increase the timeout seconds and spout pending message number. But this will cause more memory usage and I cannot increase it to infinite.
I was wandering if there is a way to force storm ignore the ack in some spout/bolt, so that it will not waiting for that signal until time out. This should increase the throughout significantly but not guarantee for message processing.
if you set the number of ackers to 0 then storm will automatically ack every sample.
config.setNumAckers(0);
please note that the UI only measures and shows 5% of the data flow.
unless you set
config.setStatsSampleRate(1.0d);
try increasing the bolt's timeout and reducing the amount of topology.max.spout.pending.
also, make sure the spout's nextTuple() method is non blocking and optimized.
i would also recommend profiling the code, maybe your storm Queues are being filled and you need to increase their sizes.
config.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE,32);
config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE,16384);
config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE,16384);
Your capacity numbers are a bit high, leading me to believe that you're really maximizing the use of system resources (CPU, memory). In other words, the system seems to be bogged down a bit and that's probably why tuples are timing out. You might try using the topology.max.spout.pending config property to limit the number of inflight tuples from the spout. If you can reduce the number just enough, the topology should be able to efficiently handle the load without tuples timing out.

Why is there a huge lag between executed and acked in storm topology

I am working on apache storm , i see there is a huge difference between executed and acked .
Following is the screenshot from Storm UI
What can we do to make acks equal to executed , i tried increasing the number of packers but that was of no help
to make it clear, I would like try to explain the two values' meaning. "Executed" represents how many times execute method is called for the bot. "Acked" means how many times the bolt calls ack.
From the snapshot above, it means booking_bolt executes "execute" method for 23300 times and call acked only 500 times.
So maybe in bolt's execute method, ack or fail is not called everytime.
From Michael G. Noll training : Why does the Storm UI report seemingly incorrect numbers?
Storm samples incoming tuples when computing statistics in order to increase performance.
Sample rate is configured via topology.stats.sample.rate.
0.05 is the default value
Here, Storm will pick a random event of the next 20 events in which to increase the metric count by 20. So if you have 20 tasks for that bolt, your stats could be off by +/- 380.
1.00 forces Storm to count everything exactly
This gives you accurate numbers at the cost of a big performance hit. For testing purposes however this is acceptable and often quite helpful.

what does each field of strm UI mean?

I want to see performance of each bolt and decide the number of parallelism.
In storm UI there are several fields which is confusing, so would be glad if you can tell me.
Capacity(last 10m) - average capacity per one second in last 10 minute of a single executor?
For example, if Capcity is 1.2, does that mean single executor processed 1.2 messages per second in average?
Execute latency and Process latency - Is it average value or value of last processed message?
and what is the difference between them?
and what is the difference between them and Capacity?
I have found a great article describing the Storm UI. You can reach it by link: http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
So, we have that:
capacity (last 10m) – If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt’s parallelism. This is (number executed * average execute latency) / measurement time.
Execute latency (ms) – The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple.
Process latency (ms) – The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received.
Also, I found that documentation for the Storm UI REST API can be very useful for understanding the fields meaning: https://github.com/apache/storm/blob/master/STORM-UI-REST-API.md

Resources