Number of Kafka consumer's clients decrease after rebalancing - spring

I've noticed that after a period of time -for example two days- consumergroup concurrency become lower that one I config.
I use spring boot and here is my code sample
factory.setConcurrency(10);
when I use following kafka command after stating kafka consumer it show exactly 10 different consumer client
bin/kafka-consumer-groups.sh --describe --group samplaConsumer --bootstrap-server localhost:9092
after a period of time when I run upper command consumer clients become lower, for example 6 distinct client and manage those 10 partitions.
how can I fix this so after re-balancing or whatever number of clients remain constant

I find out that if a consumer client takes more time than max.poll.interval.ms to process the polled data the consumer considers failed and the group will rebalance.
max.poll.interval.ms The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before the expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
And I found out that if this happens a lot that consumer client considers dead and no more rebalancing happens so the number of consumer concurrent client will decrease.
One solution that I came into is that I can decrease the number of max.poll.records so that processing of records took less time than max.poll.interval.ms.
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50); // default is 200

Related

How to configure Spring SimpleMessageListenerContainer receiveTimeout in order to scale up to a reasonable number of consumers

Use case
A backend consuming messages at various rate and inserting the messages in a DB.
Today in production my SimpleMessageListenerContainer scales to maxConcurrentConsumers even if is not necessary to handle the traffic rate.
Problem
I try to find the proper configuration of spring SimpleMessageListenerContainer in order to let spring scale up/down the number of consumers to the adequate number in order to handle the incoming traffic.
With a a fix injection rate, on a single node rabbitmq I have noticed that the scaling process stabilize at
numberOfConsumers = (injectionRate * receiveTimeoutInMilliseconds) / 1000
For example :
injection rate : 100 msg/s
container.setReceiveTimeout(100L); // 100 ms
--> consumers 11
--> Consumer capacity 100%
injection rate : 100 msg/s
container.setReceiveTimeout(1000L); // 1 s - default
--> consumers 101
--> Consumer capacity 100%
Knowing that more consumers means more threads and more amqp channels
I am wondering why the scaling algorithm is not linked to the consumerCapacity metric and why is the default receive timeout set to 1 second ?
See the documentation https://docs.spring.io/spring-amqp/docs/current/reference/html/#listener-concurrency
In addition, a new property called maxConcurrentConsumers has been added and the container dynamically adjusts the concurrency based on workload. This works in conjunction with four additional properties: consecutiveActiveTrigger, startConsumerMinInterval, consecutiveIdleTrigger, and stopConsumerMinInterval. With the default settings, the algorithm to increase consumers works as follows:
If the maxConcurrentConsumers has not been reached and an existing consumer is active for ten consecutive cycles AND at least 10 seconds has elapsed since the last consumer was started, a new consumer is started. A consumer is considered active if it received at least one message in batchSize * receiveTimeout milliseconds.
With the default settings, the algorithm to decrease consumers works as follows:
If there are more than concurrentConsumers running and a consumer detects ten consecutive timeouts (idle) AND the last consumer was stopped at least 60 seconds ago, a consumer is stopped. The timeout depends on the receiveTimeout and the batchSize properties. A consumer is considered idle if it receives no messages in batchSize * receiveTimeout milliseconds. So, with the default timeout (one second) and a batchSize of four, stopping a consumer is considered after 40 seconds of idle time (four timeouts correspond to one idle detection).
Practically, consumers can be stopped only if the whole container is idle for some time. This is because the broker shares its work across all the active consumers.
So, when you reduce the receiveTimeout you would need a corresponding increase in the idle/active triggers.
The default is 1 second to provide a reasonable compromise between spinning an idle consumer while retaining responsive behavior to a container stop() operation (idle consumers are blocked for the timeout). Increasing it will cause a less responsive container (for stop()).
It is generally unnecessary to set it lower than 1 second.

My topology's processing rate is about 2500 messages per seconds, but Complete latency is about 7ms. Shouldn't it be equal 1000 / 2500 = 0.4ms?

My topology reads from RabbitMQ, and it's processing rate is about 2500 messages per seconds, but Complete latency is about 7ms. Shouldn't it be equal 1000 / 2500 = 0.4ms?
Topology summary:
Please, help me to understand, what does mean parameter Complete latency in my case.
Topology process messages from RabbitMQ queue with rate about 2500/sec
RabbitMQ screenshot:
According to Storm docs The complete latency is just for spouts. It is the average amount of time it took for ack or fail to be called for a tuple after it was emitted.
So, it is the time between your rabbitmq-spout emitted tuple and the last bolt acked it.
Storm has an internal queue to make pressure, the maximum size of this queue is defined in topology.max.spout.pending variable in configs. If you set it to a high value your rabbit consumer would read messages from the rabbit to fulfil this queue ahead of real processing with bolts in topology, causing the wrong measure of real latency of your topology.
In the RabbitMQ panel, you see how fast messages are consumed from it, not how they are processed, you compare hot and round.
To measure latency I would recommend running your topology for a couple of days, 202 seconds according to your screenshot is too tight.

why does SwiftMQ show flow control behaviour even when flow control is disabled?

I'm trying to benchmark the performance of swiftMQ 5.0.0 with producer and consumer application I wrote so that I can vary the number of producer threads and consumer threads. I have added a delay on the consumer to simulate the time taken to process a message. I have run a test by setting the producer threads fixed at 2, and by varying the number of consumer threads from 20 to 92 in steps of 4.
Initially, the producer rate starts high and consumer rate is low (as expected due to the delay added and less number of consumer threads).
As the number of consumer threads increase, the producer rate drops and consumer rate increases and they become equal at around 48 consumer threads.
After that, as the number of consumer threads further increase, both producer and consumer rates keep increasing linearly. I am wandering what the reason for this behavior is?
see this image for the
result graph
Notes:
I have disabled flow control at queue level by setting flowcontrol-start-queuesize="-1" .
I also have not set a value to inbound-flow-control-enabled in routing swiftlet. (I believe it
defaults to false)
Any help on this matter is much appreciated. TIA

How to determine the number of consumers in a consumer group in Spingboot?

I'm using annotation #KafkaListner to listen to a specific topic. However suddenly I noticed that there is a big lagging for the consumers to receive the messages from the producers. Then I increased the number of partitions of the brokers and the issue is solved.
After some researches I realized that the number of consumers in a consumer group cannot exceed the number of partitions otherwise some of the consumers would be inactive.
So in Spring Boot, does each individual #KafkaListener is considered as a single consumer? If not, how can I find the exact number of consumers in a consumer group thus I'm able to properly configure the partitions?
does each individual #KafkaListener is considered as a single consumer?
No, it's a consumer group which can have one (default) or more consumer threads (Containers). You can use the concurrency property to override the ContainerFactory default property.
As you figured out, the number of topic's partitions determines the level of parallelism. If the concurrency is greater than the number of partitions, the concurrency is adjusted down such that each Container gets one partition.

Storm latency caused by ack

I was using kafka-storm to connect kafka and storm. I have 3 servers running zookeeper, kafka and storm. There is a topic 'test' in kafka that has 9 partitions.
In the storm topology, the number of KafkaSpout executor is 9 and by default, the number of tasks should be 9 as well. And the 'extract' bolt is the only bolt connected to KafkaSpout, the 'log' spout.
From the UI, there is a huge rate of failure in the spout. However, he number of executed message in bolt = the number of emitted message - the number of failed mesage in bolt. This equation is almost matched when the failed message is empty at the beginning.
Based on my understanding, this means that the bolt did receive the message from spout but the ack signals are suspended in flight. That's the reason why the number of acks in spout are so small.
This problem might be solved by increase the timeout seconds and spout pending message number. But this will cause more memory usage and I cannot increase it to infinite.
I was wandering if there is a way to force storm ignore the ack in some spout/bolt, so that it will not waiting for that signal until time out. This should increase the throughout significantly but not guarantee for message processing.
if you set the number of ackers to 0 then storm will automatically ack every sample.
config.setNumAckers(0);
please note that the UI only measures and shows 5% of the data flow.
unless you set
config.setStatsSampleRate(1.0d);
try increasing the bolt's timeout and reducing the amount of topology.max.spout.pending.
also, make sure the spout's nextTuple() method is non blocking and optimized.
i would also recommend profiling the code, maybe your storm Queues are being filled and you need to increase their sizes.
config.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE,32);
config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE,16384);
config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE,16384);
Your capacity numbers are a bit high, leading me to believe that you're really maximizing the use of system resources (CPU, memory). In other words, the system seems to be bogged down a bit and that's probably why tuples are timing out. You might try using the topology.max.spout.pending config property to limit the number of inflight tuples from the spout. If you can reduce the number just enough, the topology should be able to efficiently handle the load without tuples timing out.

Resources