High latency between spout -> bolt and bolt -> bolts - apache-storm

In my topology I see around 1 - 2 ms latency when transferring tuples from spouts to bolts or from bolts to bolts. I am calculating latency using nanosecond timestamps because the whole topology runs inside a single worker.
Topology is run in a cluster which runs in a production capable hardware.
To my understanding, tuples need not be serialized/de-serialized in this case as everything is inside single JVM. I have set parallelism hint for most spouts and bolts to 5 and spouts only produce events at a rate of 100 per second. I dont think high latency is due to queuing of events because I dont see any increase of latency with time. No memory increase either. log levels are set to ERROR. CPU usage is in the range of 200 to 300 %.
what could be causing this latency? I was expecting only few us's for tuple transfer.

I'm going to assume you're using one of the released Storm versions, and not 2.0.0-SNAPSHOT, since the queueing implementation has changed in that version.
I think it's likely that the delay is because Storm batches up tuples before delivering them to the consumer. Take a look at https://github.com/apache/storm/blob/v1.2.1/storm-core/src/jvm/org/apache/storm/utils/DisruptorQueue.java#L247, and also look at the Flusher class in that file. When a spout/bolt publishes a tuple, it is put into the _currentBatch list. It stays there until either enough tuples have been received so the batch is "big enough" (you can look at the _inputBatchSize variable to figure out when this is), or until the Flusher is triggered (happens by default once per millisecond).

Related

Storm ackers are limiting the performance

I was running a topology with many bolts. From the storm's ui, I can see that the Execute latency and Process latency of all bolts are very small (<1ms). However, the Complete latency of my Spouts raised up to 30s.
I thought such a huge discrepancy is caused by the ackers. Because, the ackers executed 101,522,080 times but only emitted 2,673,260, which means, if I'm correct, there are around 100,000,000 tuples are flying in the topology and waiting for Ack signal.
I tried to set the Ack numbers to 0 and disable Ack at all. But it turned out the entire system is running out of control. Also tried to double the number of ackers, but the situation does not get better.
Is the acker the real problem that limited the performance? And how to optimize such an issue?
First, setting number of ackers to zero means your spout emits all tuple when they are available so your topology encounters performance problems, failed tuples and piled up messages in consumer/spout side. Because all of your tuples acked (marked as executed ) instantly before the tuple executed in all bolts and TOPOLOGY_MAX_SPOUT_PENDING can't do its real duty.
In my opinion, first try to figure out the best TOPOLOGY_MAX_SPOUT_PENDING count for your topology. Then, tune ackers count.you can double it the number of workers and watch the performance from Strom UI.

Storm latency caused by ack

I was using kafka-storm to connect kafka and storm. I have 3 servers running zookeeper, kafka and storm. There is a topic 'test' in kafka that has 9 partitions.
In the storm topology, the number of KafkaSpout executor is 9 and by default, the number of tasks should be 9 as well. And the 'extract' bolt is the only bolt connected to KafkaSpout, the 'log' spout.
From the UI, there is a huge rate of failure in the spout. However, he number of executed message in bolt = the number of emitted message - the number of failed mesage in bolt. This equation is almost matched when the failed message is empty at the beginning.
Based on my understanding, this means that the bolt did receive the message from spout but the ack signals are suspended in flight. That's the reason why the number of acks in spout are so small.
This problem might be solved by increase the timeout seconds and spout pending message number. But this will cause more memory usage and I cannot increase it to infinite.
I was wandering if there is a way to force storm ignore the ack in some spout/bolt, so that it will not waiting for that signal until time out. This should increase the throughout significantly but not guarantee for message processing.
if you set the number of ackers to 0 then storm will automatically ack every sample.
config.setNumAckers(0);
please note that the UI only measures and shows 5% of the data flow.
unless you set
config.setStatsSampleRate(1.0d);
try increasing the bolt's timeout and reducing the amount of topology.max.spout.pending.
also, make sure the spout's nextTuple() method is non blocking and optimized.
i would also recommend profiling the code, maybe your storm Queues are being filled and you need to increase their sizes.
config.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE,32);
config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE,16384);
config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE,16384);
Your capacity numbers are a bit high, leading me to believe that you're really maximizing the use of system resources (CPU, memory). In other words, the system seems to be bogged down a bit and that's probably why tuples are timing out. You might try using the topology.max.spout.pending config property to limit the number of inflight tuples from the spout. If you can reduce the number just enough, the topology should be able to efficiently handle the load without tuples timing out.

Storm topology processing slowing down gradually

I have been reading about apache Storm tried few examples from storm-starter. Also learnt about how to tune the topology and how to scale it to perform fast enough to meet the required throughput.
I have created example topology with acking enabled, i am able to achieve 3K-5K messages processing per second. It performs really fast in initial 10 to 15min or around 1mil to 2mil message and then it starts slowing down. On storm UI, I can see the overall latency starts going up gradually and does not comes back, after a while the processing drops to only few hundred a second. I am getting exact same behavior for all the typologies i tried, the simplest one is to just read from kafka using KafkaSpout and send it to transform bolt parse the msg and send it to kafka again using KafkaBolt. The parser is very fast as it takes less than a millisecond to parse the message. I tried few option of increasing/describing the parallelism, changing the buffer sizes etc. but same behavior. Please help me to find out the reason for gradual slowness in the topology. Here is the config i am using
1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)
KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250
At the start
After few minutes
After 45 min - latency started going up
After 80 min - Latency will keep going up and will go till 100 sec by the time it reaches 8 to 10mil messages
Visual VM screenshot
Threads
Pay attention to the capacity metric on RT_LEFT_BOLT, it is very close to 1; which explains why your topology is slowing down.
From the Storm documentation:
The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.
Therefore, your solution is to add more executors (and tasks) to that given bolt (RT_LEFT_BOLT). Another thing you can do is reduce the number of executors on RT_RIGHT_BOLT the capacity indicates you don't need that many executors, probably 1 or 2 can do the job.
The issue was due to GC setting with newgen params, it was not using the allocated heap completely so internal storm queues were getting full and running out of memory. The strange thing was that storm did not throw out of memory error, it just got stalled, with the help of visual vm i was able to trace it down.

Apache Storm: what happens to a tuple when no bolts are available to consume it?

If it's linked to another bolt, but no instances of the next bolt are available for a while. How long will it hang around? Indefinitely? Long enough?
How about if many tuples are waiting, because there is a line or queue for the next available bolt. Will they merge? Will bad things happen if too many get backed up?
By default tuples will timeout after 30 seconds after being emitted; You can change this value, but unless you know what you are doing don't do it (topology.message.timeout.secs)
Failed and timeout out tuples will be replayed by the spout, if the spout is reading from a reliable data source (eg. kafka); this is, storm has guaranteed message processing. If you are codding your own spouts, you might want to dig deep into this.
You can see if you are having timeout tuples on storm UI, when tuples are failing on the spout but not on the bolts.
You don't want tuples to timeout inside your topology (for example there is a performance penalty on kafka for not reading sequential). You should adjust the capacity of your topology process tuples (this is, tweak the bolt parallelism by changing the number of executors) and by setting the parameter topology.max.spout.pending to a reasonable conservative value.
increase the topology.message.timeout.secs parameter is no real solution, because soon or late if the capacity of your topology is not enough the tuples will start to fail.
topology.max.spout.pending is the max number of tuples that can be waiting. The spout will emit more tuples as long the number of tuples not fully processed is less than the given value. Note that the parameter topology.max.spout.pending is per spout (each spout has it's internal counter and keeps track of the tuples which are not fully processed).
There is a deserialize-queue for buffering the coming tuples, if it hangs long enough, the queue will be full,and tuples will be lost if you don't use the ack function to make sure it will be resent.
Storm just drops them if the tuples are not consumed until timeout. (default is 30 seconds)
After that, Storm calls fail(Object msgId) method of Spout. If you want to replay the failed tuples, you should implement this function. You need to keep the tuples in memory, or other reliable storage systems, such as Kafka, to replay them.
If you do not implement the fail(Object msgId) method, Storm just drops them.
Reference: https://storm.apache.org/documentation/Guaranteeing-message-processing.html

Apache Storm : Relation between Executors ,Execute latency and Process latency?

Topology with 1 executor assigned to Query Normalizer
Topology with 4 executor assigned to Query Normalizer
Initially I was running my topology with only 1 executor assigned to QueryNormalizer. The execute latency was 8.952 and process latency was 12.857.
To make it faster I changed the number of executors in QueryNormalizer to 4.The execute latency changed to 197.616 and process latency to 59.132.
According to the definition of Execute latency – The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple.
So, What I understand is it should be low if I increase the number of executors.As the parallelism should increase as the executor increases.
Am I misinterpreting something ?
Also, there is a huge difference between the emitted,transmitted and executed fields. Is this normal ?
Also, Should process latency be always lower than the execute latency ?
Which of the above shown topologies are better performance wise ? Also, How should I decide which topology is running better than the other , seeing the bolts data ?
Have a look at "complete latency" in the spout, that is the value the tuples spend in average inside in your topology, it had decreed.
So, What I understand is it should be low if I increase the number of executors.As the parallelism should increase as the executor increases.
it means you have now 4 units processing tuples, each unit process 1 tuple at the time, "theoretically" let you process 4 tuples at the same time instead of 1. Do your tuples look always the the same? this is, do they have always the same complexity?
Also, there is a huge difference between the emitted,transmitted and executed fields. Is this normal ?
executed means how many tuples your bolt consumed; emitted means how many tuples your bolt generated (in your case i see each consumed tuple is generating around 4 new tuples); transfered means how many emitted tuples were transfered to other bolts, for example you have two bolts consuming from the bolt emitting, in this case transfered would be equal a 2 * nr of tuples emitted.
Also, Should process latency be always lower than the execute latency ?
Not necessaly, have for example at Nathan Marz definition:
Process latency is time until tuple is acked, execute latency is time spent in execute for a tuple
and I can give you an example of one of my topologies where this does not happen:
Which of the above shown topologies are better performance wise ? Also, How should I decide which topology is running better than the other , seeing the bolts data ?
well let them run for a longer period of time. Both processed less than 1000 tuples, the size of the sample is too small. Ultimately the metric is the "complete latency" on the spout and the number of failed tuples.

Resources