Long delays between processing of two consecutive kafka batches (using ruby/karafka consumer) - ruby

I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event.
However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.
Following instrumentation around the consume method, implies that the consumer code itself is not taking time.
https://github.com/karafka/karafka/blob/master/lib/karafka/backends/inline.rb#L12
INFO Inline processing of topic production.events with 8 messages took 2571 ms
INFO 8 messages on production.events topic delegated to xyz
However, I notice two things:
When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.
Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?
======
Edit:
There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.
Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.
Partition
FirstOffset
LastOffset
Size
LeaderNode
ReplicaNodes
In-syncReplicaNodes
OfflineReplicaNodes
PreferredLeader
Under-replicated
[0]
2174152
3567554
1393402
5
5,4,3
3,4,5
Yes
No
1
2172222
3566886
1394664
4
4,5,6
4,5,6
Yes
No
[2]
2172110
3564992
1392882
1
1,6,4
1,4,6
Yes
No
However, I do see that one consumer perpetually lags behind the other two.
Following table shows the lag for my consumers. There is one consumer process for each partition:
Partition
First Offset
Last Offset
Consumer Offset
Lag
0
2174152
3566320
2676120
890200
1
2172222
3565605
3124649
440956
2
2172110
3563762
3185587
378175
Combined lag
1709331
Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.
Update after setting max_wait_time to 1 second (previously 5 second)
It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume
2021-06-24 13:43:23.425 Inline processing of topic x with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms

There are a couple of problems you may be facing. It is a bit of a guessing from my side without more details but let's give it a shot.
From the Kafka perspective
Are you sure you're evenly distributing data across partitions? Maybe it is eating up things from one partition?
What you wrote here:
INFO Inline processing of topic production.events with 8 messages took 2571 ms
This indicates that there was a batch of 8 processed altogether by a single consumer. This could indicate that the data is not distributed evenly.
From the performance perspective
There are two performance properties that can affect your understanding of how Karafka operates: throughput and latency.
Throughput is the number of messages that can be processed in a given time
Latency is the time it takes a message from the moment it was produced to it been processed.
As far as I understand, all messages are being produced. You could try playing with the Karafka settings, in particular this one: https://github.com/karafka/karafka/blob/83a9a5ba417317495556c3ebb4b53f1308c80fe0/lib/karafka/setup/config.rb#L114
From the logger perspective
Logger that is being used flushes data from time to time, so you won't see it immediately but after a bit of time. You can validate this by looking at the log time.

Related

how to optimally use nifi wait processor

I am currently creating a flow, where I will be merging result of 10K http response. I have couple of questions. (please refer image below, I am numbering my questions as per image).
1) As queue is becoming too long, is it ok to put "concurrent task" as 10 for invokeHTTP? what should drive this? # of cores on the server?
2) wait is showing quite a big number, is this just # of bytes it is writing? or is this using that much memory? if this is just a write, then I might be ok...but if it is some internal queue, then soon I may run out of memory?
does it make sense to reduce this number? by increasing "Run Schedule" from 0 to say 20 sec?
3) what exactly is "Back Pressure Data Size Threshold", value is set at 1 GB, does it meant, if size of ff in queue is more than that, nifi will start dropping it? or will it somehow stop processing of upstream processor?
1) Yes increasing concurrent tasks on InvokeHttp would probably make sense. I wouldn't jump right to 10, but would test increasing from 1 to 2, 2 to 3, etc until it seems to be working better. Concurrent tasks is the number of threads that can concurrently execute the processor, the total number of threads for your NiFi instance is defined in the controller settings from top right menu under Timer Driven threads, you should set the timer driven threads based of the # of CPUs/core you have.
2) The stats on the processor are totals for the last 5 mins, so "In" is the total size of all the flow files that have come in to the processor in the last 5 mins. You can see "Out" is almost the same # which means almost all the flow files in have also been transferred out.
3) Back-pressure stops the upstream processor from executing until the back pressure threshold is reduced. The data size threshold is saying "when the total size of all flow files in the queue exceeds 1GB, then stop executing the upstream processor so that no more data enters the queue while the downstream processor works on the queue". In the case of a self-loop connection, I think back-pressure won't stop the processor from executing otherwise it will end up in a dead-lock where it can't produce more data but also can't work off the queue. In any case, data is never dropped unless you set flow file expiration on the queue.

Azure Table Increased Latency

I'm trying to create an app which can efficiently write data into Azure Table. In order to test storage performance, I created a simple console app, which sends hardcoded entities in a loop. Each entry is 0.1 kByte. Data is sent in batches (100 items in each batch, 10 kBytes each batch). For every batch, I prepare entries with the same partition key, which is generated by incrementing a global counter - so I never send more than one request to the same partition. Also, I control a degree of parallelism by increasing/decreasing the number of threads. Each thread sends batches synchronously (no request overlapping).
If I use 1 thread, I see 5 requests per second (5 batches, 500 entities). At that time Azure portal metrics shows table latency below 100ms - which is quite good.
If I increase the number of treads up to 12 I see x12 increase in outgoing requests. This rate stays stable for a few minutes. But then, for some reason I start being throttled - I see latency increase and requests amount drop.
Below you can see account metrics - highlighted point shows 2K31 transactions (batches) per minute. It is 3850 entries per second. If threads are increased up to 50, then latency increases up to 4 seconds, and transaction rate drops to 700 requests per second.
According to documentation, I should be able to send up to 20K transaction per second within one account (my test account is used only for my performance test). 20K batches mean 200K entries. So the question is why I'm being throttled after 3K entries?
Test details:
Azure Datacenter: West US 2.
My location: Los Angeles.
App is written in C#, uses CosmosDB.Table nuget with the following configuration: ServicePointManager.DefaultConnectionLimit = 250, Nagles Algorithm is disabled.
Host machine is quite powerful with 1Gb internet link (i7, 8 cores, no high CPU, no high memory is observed during the test).
PS: I've read docs
The system's ability to handle a sudden burst of traffic to a partition is limited by the scalability of a single partition server until the load balancing operation kicks-in and rebalances the partition key range.
and waited for 30 mins, but the situation didn't change.
EDIT
I got a comment that E2E Latency doesn't reflect server problem.
So below is a new graph which shows not only E2E latency but also the server's one. As you can see they are almost identical and that makes me think that the source of the problem is not on the client side.

High latency between spout -> bolt and bolt -> bolts

In my topology I see around 1 - 2 ms latency when transferring tuples from spouts to bolts or from bolts to bolts. I am calculating latency using nanosecond timestamps because the whole topology runs inside a single worker.
Topology is run in a cluster which runs in a production capable hardware.
To my understanding, tuples need not be serialized/de-serialized in this case as everything is inside single JVM. I have set parallelism hint for most spouts and bolts to 5 and spouts only produce events at a rate of 100 per second. I dont think high latency is due to queuing of events because I dont see any increase of latency with time. No memory increase either. log levels are set to ERROR. CPU usage is in the range of 200 to 300 %.
what could be causing this latency? I was expecting only few us's for tuple transfer.
I'm going to assume you're using one of the released Storm versions, and not 2.0.0-SNAPSHOT, since the queueing implementation has changed in that version.
I think it's likely that the delay is because Storm batches up tuples before delivering them to the consumer. Take a look at https://github.com/apache/storm/blob/v1.2.1/storm-core/src/jvm/org/apache/storm/utils/DisruptorQueue.java#L247, and also look at the Flusher class in that file. When a spout/bolt publishes a tuple, it is put into the _currentBatch list. It stays there until either enough tuples have been received so the batch is "big enough" (you can look at the _inputBatchSize variable to figure out when this is), or until the Flusher is triggered (happens by default once per millisecond).

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Storm topology processing slowing down gradually

I have been reading about apache Storm tried few examples from storm-starter. Also learnt about how to tune the topology and how to scale it to perform fast enough to meet the required throughput.
I have created example topology with acking enabled, i am able to achieve 3K-5K messages processing per second. It performs really fast in initial 10 to 15min or around 1mil to 2mil message and then it starts slowing down. On storm UI, I can see the overall latency starts going up gradually and does not comes back, after a while the processing drops to only few hundred a second. I am getting exact same behavior for all the typologies i tried, the simplest one is to just read from kafka using KafkaSpout and send it to transform bolt parse the msg and send it to kafka again using KafkaBolt. The parser is very fast as it takes less than a millisecond to parse the message. I tried few option of increasing/describing the parallelism, changing the buffer sizes etc. but same behavior. Please help me to find out the reason for gradual slowness in the topology. Here is the config i am using
1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)
KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250
At the start
After few minutes
After 45 min - latency started going up
After 80 min - Latency will keep going up and will go till 100 sec by the time it reaches 8 to 10mil messages
Visual VM screenshot
Threads
Pay attention to the capacity metric on RT_LEFT_BOLT, it is very close to 1; which explains why your topology is slowing down.
From the Storm documentation:
The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.
Therefore, your solution is to add more executors (and tasks) to that given bolt (RT_LEFT_BOLT). Another thing you can do is reduce the number of executors on RT_RIGHT_BOLT the capacity indicates you don't need that many executors, probably 1 or 2 can do the job.
The issue was due to GC setting with newgen params, it was not using the allocated heap completely so internal storm queues were getting full and running out of memory. The strange thing was that storm did not throw out of memory error, it just got stalled, with the help of visual vm i was able to trace it down.

Resources