Storm processing time - time

How can I get processing time (sec) for all bolts in topology when I have some amount of tuples? I'd like to probe storm performance with number of tuples/time/nodes.

If you want to measure this type of performance then you have to write your own code to do it. There are some good options to help you do this:
Storm metrics (comes built in with Storm 0.9.0+)
Metrics (formerly Codahale metrics)
The poor man's approach is to add a tuple start time when you begin processing a tuple and try to calculate it yourself but this is unreliable on a cluster due to many machines might have slightly different times.

Related

How can I find out which stream tasks are slowest?

I have a number of Kafka Streams applications, some with fairly a complex topology. The performance of the application is much slower than I would like. I suspect this is because of an unoptimized Stream Task.
I need to identify which tasks are the slowest, so I can work on improving the performance.
An example of what would be helpful is a list of task names, e.g. KSTREAM-FILTER-0000000007 or --> my-application.join-information-with-prototypes-subscription-response-sink and next to each a minimum, maximum, and average processing time in milliseconds. I can then pick out the slowest tasks, and try to improve the performance.
I'm using the Kafka Streams Java DSL, in a Kotlin application.
I've found that Kafka Streams exposes metrics using JMX. I've tried using JConsole to investigate, but the available metrics were not at all clear. There doesn't seem to be a practical way to use the information, like querying or sorting.

High latency between spout -> bolt and bolt -> bolts

In my topology I see around 1 - 2 ms latency when transferring tuples from spouts to bolts or from bolts to bolts. I am calculating latency using nanosecond timestamps because the whole topology runs inside a single worker.
Topology is run in a cluster which runs in a production capable hardware.
To my understanding, tuples need not be serialized/de-serialized in this case as everything is inside single JVM. I have set parallelism hint for most spouts and bolts to 5 and spouts only produce events at a rate of 100 per second. I dont think high latency is due to queuing of events because I dont see any increase of latency with time. No memory increase either. log levels are set to ERROR. CPU usage is in the range of 200 to 300 %.
what could be causing this latency? I was expecting only few us's for tuple transfer.
I'm going to assume you're using one of the released Storm versions, and not 2.0.0-SNAPSHOT, since the queueing implementation has changed in that version.
I think it's likely that the delay is because Storm batches up tuples before delivering them to the consumer. Take a look at https://github.com/apache/storm/blob/v1.2.1/storm-core/src/jvm/org/apache/storm/utils/DisruptorQueue.java#L247, and also look at the Flusher class in that file. When a spout/bolt publishes a tuple, it is put into the _currentBatch list. It stays there until either enough tuples have been received so the batch is "big enough" (you can look at the _inputBatchSize variable to figure out when this is), or until the Flusher is triggered (happens by default once per millisecond).

Storm ackers are limiting the performance

I was running a topology with many bolts. From the storm's ui, I can see that the Execute latency and Process latency of all bolts are very small (<1ms). However, the Complete latency of my Spouts raised up to 30s.
I thought such a huge discrepancy is caused by the ackers. Because, the ackers executed 101,522,080 times but only emitted 2,673,260, which means, if I'm correct, there are around 100,000,000 tuples are flying in the topology and waiting for Ack signal.
I tried to set the Ack numbers to 0 and disable Ack at all. But it turned out the entire system is running out of control. Also tried to double the number of ackers, but the situation does not get better.
Is the acker the real problem that limited the performance? And how to optimize such an issue?
First, setting number of ackers to zero means your spout emits all tuple when they are available so your topology encounters performance problems, failed tuples and piled up messages in consumer/spout side. Because all of your tuples acked (marked as executed ) instantly before the tuple executed in all bolts and TOPOLOGY_MAX_SPOUT_PENDING can't do its real duty.
In my opinion, first try to figure out the best TOPOLOGY_MAX_SPOUT_PENDING count for your topology. Then, tune ackers count.you can double it the number of workers and watch the performance from Strom UI.

Impact of more executors than cpu/cores in Storm Cluster

I have started using Apache Storm recently. Right now focusing on some performance testing and tuning for one of my applications (pulls data from a NoSQL database, formats, and publishes to a JMS Queue for consumption by the requester) to enable more parallel request processing at a time. I have been able to tune the topology in terms of changing no. of bolts, MAX_SPENDING_SPOUT etc. and to throttle data flow within topology using some ticking approach.
I wanted to know what happens when we define more parallelism than the no of cores we have. In my case I have a single node, single worker topology and the machine has 32 cores. But total no of executors (for all the spouts and bolts) = 60. So my questions are:
Does this high number really helps processing requests or is it actually degrades the performance, since I believe there will more context switch between bolt tasks to utilize cores.
If I define 20 (just a random selection) executors for a Bolt and my code flow never needs to utilize the Bolt, will this be impacting performance? How does storm handles this situation?
This is a very general question, so the answer is (as always): it depends.
If your load is large and a single executor fully utilizes a core completely, having more executors cannot give you any throughput improvements. If there is any impact, it might be negative (also with regard to contention of internally used queues to which all executers need to read from and write into for tuple transfer).
If you load is "small" and does not fully utilize your CPUs, it wound matter either -- you would not gain or loose anything -- as your cores are not fully utilized you have some left over head room anyway.
Furthermore, consider that Storm spans some more threads within each worker. Thus, if your executors fully utilize your hardware, those thread will also be impacted.
Overall, you should not run your topologies to utilize core completely anyway but leave form head room for small "spikes" etc. In operation, maybe 80% CPU utilization might be a good value. As a rule of thumb, one executor per core should be ok.

How many tuples in one batch is reasonable in Storm?

I am a newbie of Storm. When I try trident with the tutorial example, they are usually a very small amount of tuples in one batch(usually no more than 10).
Trident aim to provide a high throughput,says millions of message per second.
So I want to ask how many tuples in one batches is reasonable in real world?
There is no straight forward answer to that question. It all depends on your workload and what kind of topology you are running. Once you have a desired topology, you can look at the overall throughput metrics and keep on bumping up the batch size till you start seeing some performance issues and debug that. If it's just due to the way your processing is structured and you cannot improve it any further then you can settle for a batch size smaller than that.
From:
https://groups.google.com/forum/?fromgroups=#!topic/storm-user/IfMR-kHvkBg

Resources