Spring camel kafka - Re-balancing and removing consumer - spring-boot

We have seen it where a consumer is removed from the consumer group, but I cant understand why.
As you can see from the errors below it suggests a timeout on Poll()
The TPS is less than 1, so very low, and each request takes around 200ms to ingest and push to DB.
This happened on 2 occasions in the within days of each other.
Result was that the service no longer read the message from the partition and a restart was required (Not good when you don't have alerting on offset buildup)
Any help/pointers would be greatly appreciated
Spring boot 2.5.13
Camel 3.16.0
2 Java applications (One in each DC)
1 Topic with 2 partitions
ERROR org.apache.camel.processor.errorhandler.DeadLetterChannel - log - Failed delivery for (MessageId: 4AA2CA19996CA12-000000000000424E on ExchangeId: 4AA2CA19996CA12-000000000000424E). On delivery attempt: 0 caught: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
WARN org.apache.camel.component.kafka.KafkaFetchRecords - handlePollErrorHandler - Deferring processing to the exception handler based on polling exception strategy
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - handle - [Consumer clientId=consumer-pdr-writer-service-2, groupId=pdr-writer-service] Offset commit failed on partition MY-TOPIC-0 at offset 166742: The coordinator is not aware of this member.
auto.commit.interval.ms = 5000
auto.offset.reset = latest
connections.max.idle.ms = 540000
session.timeout.ms = 10000
max.poll.interval.ms = 300000
max.poll.records = 500
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
group.id = a438f569-5701-4a83-885c-9111dfcbc743
group.instance.id = null
heartbeat.interval.ms = 3000
enable.auto.commit = true
A log we only saw once, at the same time we had these issues.
Requesting the consumer to retry polling the same message based on polling exception strategy
Exception org.apache.kafka.common.errors.TimeoutException caught while polling TOPIC-NAME-Thread 0 from kafka topic TOPIC-NAME at offset {TOPIC-NAME/1=166743}: Timeout of 5000ms expired before successfully committing offsets {TOPIC-NAME-1=OffsetAndMetadata{offset=166744, leaderEpoch=null, metadata=''}}
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - handle - [Consumer clientId=consumer-pdr-writer-service-2, groupId=pdr-writer-service] Offset commit failed on partition TOPIC-NAME-1 at offset 166744: The coordinator is not aware of this member.

Related

messages duplicated during rebalancing after service recovery from Kafka SSLHandshakeException

Current setup - Our Springboot application consumes messages from Kafka topic,We are processing one message at a time (we are not using streams).Below are the config properties and version being used.
ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG- 30000
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG-earliest
ContainerProperties.AckMode-RECORD
Spring boot version-2.5.7
Spring-kafka version- 2.7.8
Kafks-clients version-2.8.1
number of partitions- 6
consumer group- 1
consumers- 2
Issue - When springboot application stays idle for longer time(idle time varying from 4 hrs to 3 days).We are seeing org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Exception error message - org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching kafka-2.broker.emh-dev.service.dev found.
2022-04-07 06:58:42.437 ERROR 24180 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : Authentication/Authorization Exception, retrying in 10000 ms
After service recover we are seeing message duplication with same partition and offsets which is inconsistent.
Below are the exception:
Consumer clientId=XXXXXX, groupId=XXXXXX] Offset commit failed on partition XXXXXX at offset 354: The coordinator is not aware of this member
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records

Spring cloud stream - Kafka consumer consuming duplicate messages with StreamListener

With our spring boot app, we notice kafka consumer consuming message twice randomly once in a while only in prod env. We have 6 instances with 6 partitions deployed in PCF.We have caught messages with same offset and partition received twice in same topic which causes duplicates and is a business critical for us.
We haven't noticed this in non production env and it is hard to reproduce in non prod env. We have recently switched to Kafka and we are not able to find out the root issue.
We are using spring-cloud-stream/spring-cloud-stream-binder-kafka- 2.1.2
Here is the Config:
spring:
cloud:
stream:
default.consumer.concurrency: 1
default-binder: kafka
bindings:
channel:
destination: topic
content_type: application/json
autoCreateTopics: false
group: group
consumer:
maxAttempts: 1
kafka:
binder:
autoCreateTopics: false
autoAddPartitions: false
brokers: brokers list
bindings:
channel:
consumer:
autoCommitOnError: true
autoCommitOffset: true
configuration:
max.poll.interval.ms: 1000000
max.poll.records: 1
group.id: group
We use #Streamlisteners to consume the messages.
Here is the instance we received duplicate and the error message captured in server logs.
ERROR 46 --- [container-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator
: [Consumer clientId=consumer-3, groupId=group] Offset commit failed
on partition topic-0 at offset 1291358: The coordinator is not aware
of this member. ERROR 46 --- [container-0-C-1]
o.s.kafka.listener.LoggingErrorHandler : Error while processing:
null OUT org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and
assigned the partitions to another member. This means that the time
between subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this
either by increasing the session timeout or by reducing the maximum
size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:871)
~[kafka-clients-2.0.1.jar!/:na]
There is no crash and all the instances are healthy at the time of duplicate. Also there is confusion with error log - Error while processing: null since Message was successfully processed twice. And max.poll.interval.ms: 100000 which is about 16 minutes and it is supposed to be enough time to process any message for the system and session timeout and heartbit config is default. Duplicate is received within 2 seconds in most of the instances.
Any configs that we are missing ? Any suggestion/help is highly appreciated.
Commit cannot be completed since the group has already rebalanced
A rebalance occurred because your listener took too long; you should adjust max.poll.records and max.poll.interval.ms to make sure you can always handle the records received within the timelimit.
In any case, Kafka does not guarantee exactly once delivery, only at least once delivery. You need to add idempotency to your application and detect/ignore duplicates.
Also, keep in mind StreamListener and the annotation-based programming model has been deprecated for 3+ years and has been removed from the current main, which means the next release will not have it. Please migrate your solution to a functional based programming model

KAFKA SINK CONNECT: WARN Bulk request 167 failed. Retrying request

I have a data process with Input Topic,Kafka Stream and Output Topic connected to a sink connect for Elasticsearch.
At the beginning of this operation, the data ingestion is done satisfactorily, but when the process has been running for a longer time, Elasticsearch ingestion from connector starts to fail.
I have been checking all the Workers logs and I get the following message which I suspect may be the reason:
[2021-10-21 11:22:14,246] WARN Bulk request 168 failed. Retrying request. (io.confluent.connect.elasticsearch.ElasticsearchClient:335)
java.net.SocketTimeoutException: 3,000 milliseconds timeout on connection http-outgoing-643 [ACTIVE]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
at java.base/java.lang.Thread.run(Thread.java:829)
[2021-10-21 11:27:23,858] INFO [Consumer clientId=connector-consumer-ElasticsearchSinkConnector-topic01-0, groupId=connect-ElasticsearchSinkConnector-topic01] Member connector-consumer-ElasticsearchSinkConnector-topic01-0-41b68d34-0f00-4887-b54e-79561fffb5e5 sending LeaveGroup request to coordinator kafka1:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1042)
I have tried to change the connector configuration, but I don't understand the main reason for this problem to fix it.
Connector Configuration:
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
connection.password=xxxxx
topics=output_topic
value.converter.schemas.enable=false
connection.username=user-x
name=ElasticsearchSinkConnector-output_topic
connection.url=xxxxxxx
value.converter=org.apache.kafka.connect.json.JsonConverter
key.ignore=true
key.converter=org.apache.kafka.connect.storage.StringConverter
schema.ignore=true
Is it possible that the Bulk Warn causes a loss of data?
Any help would be appreciated
you can try adding
"flush.timeout.ms": 30000

seekToCurrentErrorHandler fails in case there are multiple failed records from different partitions if FixedBackOff is set as FixedBackOff(0L, 1)

With spring-kafka-2.5.4.RELEASE version, when there are multiple failed records from different partitions, seekToCurrentErrorHandler fails if FixedBackOff is set with maxAttempts as 1 and interval other than -1L.
SeekToCurrentErrorHandler seekToCurrentErrorHandler = new SeekToCurrentErrorHandler(,new FixedBackOff(0L, 1));
Although setting a value for interval other than -1L doesn't make sense when the maxAttemps count is 1 (as there will be no retry and hence no retry interval), shouldn't it either fail at startup complaining same or should be handled appropriately?.
It fails at run time when there are multiple failed records from different partitions with below error.
ERROR org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Error handler threw an exception
org.springframework.kafka.KafkaException: Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: <some IO Exception here, not one of them defined in FailedRecordProcessor.configureDefaultClassifier()>
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:157)
This seems to be with the below line.
Line 96 of FailedRecordTracker(i.e. if (nextBackOff != BackOffExecution.STOP) { )
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/FailedRecordTracker.java#L96
which subsequently is resulting in entry to line 157 of SeekUtils(i.e. throw new KafkaException("Seek to current after exception", level, thrownException);)
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/SeekUtils.java#L157
Perhaps you are migrating from an older version.
maxAttempts in FixedBackOff means max retry attempts so should be 0 for no retries.
See https://docs.spring.io/spring-kafka/docs/2.5.10.RELEASE/reference/html/#seek-to-current
Starting with version 2.3, a BackOff can be provided to the SeekToCurrentErrorHandler and DefaultAfterRollbackProcessor so that the consumer thread can sleep for some configurable time between delivery attempts. Spring Framework provides two out of the box BackOff s, FixedBackOff and ExponentialBackOff. The maximum back off time must not exceed the max.poll.interval.ms consumer property, to avoid a rebalance.
IMPORTANT: Previously, the configuration was "maxFailures" (which included the first delivery attempt). When using a FixedBackOff, its maxAttempts property represents the number of delivery retries (one less than the old maxFailures property). Also, maxFailures=-1 meant retry indefinitely with the old configuration, with a BackOff you would set the maxAttempts to Long.MAX_VALUE for a FixedBackOff and leave the maxElapsedTime to its default in an ExponentialBackOff.

How Kafka retries works with request.timeout.?

I have configured my Producer with request.timeout.ms = 70,0000ms and retries=5. I have doubt how this actually works,
After this "request.timeout.ms=70,000" expires it retries for 5 times or within given "request.timeout.ms=70,000" it retries for 5 time with retry.backoff.ms value.?
There are 3 important configs to be aware of:
"request.timeout.ms" - time to retry a single request
"delivery.timeout.ms" - time to complete the entire send operation
"retries" - how many times to retry when the broker responds with retriable errors.
The Apache Kafka recommendation is to set "delivery.timeout.ms" and leave the other two configurations with their default value. The idea is that the main thing you as a user should worry about is how long you want to way for Kafka to figure things out before giving up on it. It doesn't really matter what is taking Kafka so long - the connection, getting metadata, long queues, etc, the only thing that matters is how long you are willing to wait.
Now to your question - request.timeout.ms applies on each retry. So Producer will send the recordbatch to Kafka, and if there's no response after 70,000ms it will consider this a failure and retry. Note that most errors (say, NoLeaderForPartition) will return from the broker much faster (which is why retry backoffs are needed).
Reasoning about delivery times with retries + request.timeout.ms turned out to be near impossible even for those who wrote the producer. Hence, the introduction of delivery.time.ms with a very clear contract.

Resources