Spring Boot Kafka: Commit cannot be completed since the group has already rebalanced - spring

Today, in my Spring Boot and single instance Kafka application I faced the following issue:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
be completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between
subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either
by increasing the session timeout or by reducing the maximum size of
batches returned in poll() with max.poll.records.
What may be the reason for this and how to fix it? As far as I understand - my consumer was blocked for a long time and didn't respond for the heartbeat. And I should adjust Kafka properties in order to address it. Could you please tell me what exact properties should I adjust and where, for example on the Kafka side or on my application Spring Kafka side?

By default Kafka will return a batch of records of fetch.min.bytes (default 1) up to either max.poll.records (default 500), or fetch.max.bytes (default 52428800), otherwise it will wait fetch.wait.max.ms (default 100) before returning a batch of data. Your consumer is expected to do some work on that data and then call poll() again. Your consumer's work is expected to be completed within max.poll.interval.ms (default 300000 — 5 mins in pre v2.0 and 30000 - 30 seconds post v2.0). If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
So to fix your issue, reduce the number of messages returned, or increase max.poll.interval.ms property to avoid timing out and rebalancing.

Related

Spring boot thread pool executor rest template behavior in case queueCapacity is 0 is decreasing performance for a rest apis application

I am stuck with a strange problem and not able to find out its root cause. This is my rest template thread pool executor :
connectionRequestTimeout: 60000
connectTimeout: 60000
socketTimeout: 60000
responseTimeout: 60000
connectionpoolmax: 900
defaultMaxPerRoute: 20
corePoolSize: 10
maxPoolSize: 300
queueCapacity: 0
keepAliveSeconds: 1
allowCoreThreadTimeOut: true
1) I know as the queueCapacity is 0 thread pool executor is going to create SynchronusQueue. The first issue is if I give its value positive integer value such as 50, application performance is decreasing. As per my understanding, we should only be using SynchronouseQueue in rare cases not in a spring boot rest API based application like mine.
2) Second thing is, I want to understand how SynchronousQueue works in a spring boot rest API application deployed on a server (tomcat). I know A SynchronousQueue has zero capacity so a producer blocks until a consumer is available, or a thread is created. But who consumer and producer in this case as all the requests are served by a web or application server. How does SynchronousQueue will basically work in this case?
I am checking the performance by running JMeter script on my machine. This script can handle more cases with queueCapacity 0 rather than some > 0.
I really appreciate any insight.
1) Don't set the queueCapacity explicitly otherwise, it is bound to degrade performance. Since we're limiting the incoming requests that can reside in the queue and it will be taken up once one of the thread becomes available from the fixed threadpool.
ThreadPoolTaskExecutor has a default configuration of the core pool
size of 1, with unlimited max pool size and unlimited queue capacity.
https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/scheduling/concurrent/ThreadPoolTaskExecutor.html
2) In a SynchronousQueue, pairs of insert and remove operations always occur simultaneously, so the queue never actually contains anything. It passes data synchronously to other thread, it waits for the other party to take the data instead of just putting data and returning.
Read more:
https://javarevisited.blogspot.com/2014/06/synchronousqueue-example-in-java.html#ixzz6PFz4Akom
https://www.baeldung.com/thread-pool-java-and-guava
I hope my answer can help you in one way.

What is the right way to handle poll interval for spring kafka for a delay queue implementation?

I am implementing a sort of delay queue with kafka. Every message which is received by spring kafka listener container for a certain topic (say t1) is meant to be delayed for a certain time (say d minutes) and then msg is to be sent back to another topic (say t2).
Currently I am doing this in the spring kafka listener container method (AcknowledgingConsumerAwareMessageListener):
receive the msg from t1
pause the listener container
sleep for d minutes if required
resume the listener container
send the msg to t2
I understand heartbeat thread is a different thread and will not be impacted by the above steps, but the polling happens in the same thread as processing and after the record is processed as per this answer. I have set my #KafkaListener properties to "max.poll.interval.ms=2xd", so that it doesn't timeout, yet I get NonResponsiveConsumerEvent (with timeSinceLastPoll) from the KafkaEventListener. Even when I don't set max.poll.interval.ms for #KafkaListener properties, I still get the same NonResponsiveConsumerEvent. In both the cases, the message is processed only once and sent to t2.
Questions
If the poll doesn't happen within max.poll.interval.ms when the listener container is paused, what is the consequence of it? What about when the container is not paused? (I have configured consumer to manual ack)
Should I spawn a separate thread for sleeping and resuming the container and thus free the container processing thread to poll? Does it matter?
versions: Spring Boot 2.1.8, Spring Kafka 2.2.8
sleep for d minutes if required
You can't "sleep" the consumer thread for more than max.poll.interval.ms.
The whole point of pausing the container is so that it continues to poll (but will never get any new records until resumed).
If you actually sleep the listener there is no point in pausing; you just need to increase th max.poll.interval.ms appropriately.

IIB Collector Node and transactions

I am using a Collector Node in my message flow. It is configured to collect 50 message or wait for 30 seconds. Under load testing, Websphere MQ sometimes says that a long-running transaction has been detected, and the pid corresponds with the pid of the application's execution group. The question is: is it possible that the Collector Node does not commit its internal transaction while waiting for the messages or for the timeout expiry?
The MQInput node is where the transactionality is specified. This is described in the IIB v10 KC page Developing integration solutions > Developing message flows > Message flow behavior > Changing message flow behavior > Configuring transactionality for message flows > Configuring MQ nodes for transactions
If you set the property to Yes (the default option): if a transaction is not already inflight, the node starts a transaction.
The Collector Node does not commit until it times out or reaches the count. See the IIB v10 KC page Reference > Message flow development > Built-in nodes > Collector node
All input messages that are received under sync point from a transaction or thread by the Collector node are stored in internal queues. Storing the input messages under sync point ensures that the messages remain in a consistent state for the outgoing thread to process; such messages are available only at the end of the transaction or thread that propagates the input messages.
A new transaction is created when a message collection is complete, and is propagated to the next node.
Whenever you configure any node(those are eligible as per IBM documentation) to work under transaction, they don't commit until the unit-of-work gets completed. In your case since 50 messages(if arrived in 30 secs) are requested in one unit-of-work, the message flow that has collector node and all other nodes in that flow commit once all 50 messages are successfully processed. During this time period, Queue manager has to maintain this in-flight state in its logs which I had stated previously which had to be increased. So any large unit-of-work causes this issue irrespective of node used
Since your issue deals with MQ long running transaction, ensure you have enough MQ log space for transaction handling by the queue manager.
To increase the MQ log space go to the below path and increase the primary and secondary number
==> IBM\WebSphere MQ\qmgrs\QMNAME\qm.ini
Below are the content that you have to increase. By default it is 3 and 2. Ensure you have space on your disc to whatever number you are increasing it to. Restart your queue manager once the qm.ini file has been updated.
Log:
LogPrimaryFiles=3
LogSecondaryFiles=2
Link to MQ config on :
https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.0.0/com.ibm.mq.con.doc/q018710_.htm
Hope this helps.

Kafka consumer not committing offset correctly

I had a Kafka consumer defined with the following properties :
session.timeout.ms = 60000
heartbeat.interval.ms = 6000
We noticed a lag of ~2000 messages and saw that the same message is being consumed multiple times by the consumer (via our app logs). Also, noticed that some of the messages were taking ~10 seconds to be completely processed. Our suspicion was that the consumer was not committing the offset properly (or was committing the same old offset repeatedly), because of which the same message was being picked up by the consumer.
To fix this, we introduced a few more properties :
auto.commit.interval.ms=20000 //To ensure that commit is happening only after processing of message is completed
max.poll.records=10 //To make the consumer pick only 10 messages in one go
And, we set the concurrency to 1.
This fixed our issue. The lag started to reduce and ultimately came to 0.
But, I am still unclear why the problem occurred in the first place.
As I understand, by default :
enable.auto.commit = true
auto.commit.interval.ms=5000
So, ideally the consumer should have been committing every 5 seconds. If the message was not completely processed within this timeframe, what happens? What offset is being committed by the consumer? Did the problem occur due to large poll record size (which is 500 by default)
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
So, originally if the poll() was earlier taking place in every 5 seconds (default auto.commit.interval), why was not it committing the latest offset? Because the consumer was still not done processing it? Then, it should have committed that offset at the next 5th second.
Can someone please answer these queries and explain why the original problem occurred?
If you are using Spring for Apache Kafka, we recommend setting enable.auto.commit to false so that the container will commit the offsets in a more deterministic fashion (either after each record, or each batch of records - the default).
Most likely, the problem was max.poll.interval.ms which is 5 minutes by default. If your batch of messages take longer than this you would have seen that behavior. You can either increase max.poll.interval.ms or, as you have done, reduce max.poll.records.
The key is that you MUST process the records returned by the poll in less than max.poll.interval.ms.
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
That is incorrect; poll() is NOT called in the background; heartbeats are sent in the background since KIP-62.

spring DMLC poller frequency change

Is it possible to change the poller frequency OOTB in DefaultMessageListenerContainer. If so, is it a dynamic configuration ?
See the receiveTimeout property - the thread blocks for up to this time until a message arrives; yes it can be changed after the container starts - but it won't take effect until the thread is released by the client library.
It defaults to 5 seconds; if there's no message, the container immediately loops around and receives again.
Setting it to a too-high value will make the container less responsive to stop() invocations.

Resources