seekToCurrentErrorHandler fails in case there are multiple failed records from different partitions if FixedBackOff is set as FixedBackOff(0L, 1) - spring-boot

With spring-kafka-2.5.4.RELEASE version, when there are multiple failed records from different partitions, seekToCurrentErrorHandler fails if FixedBackOff is set with maxAttempts as 1 and interval other than -1L.
SeekToCurrentErrorHandler seekToCurrentErrorHandler = new SeekToCurrentErrorHandler(,new FixedBackOff(0L, 1));
Although setting a value for interval other than -1L doesn't make sense when the maxAttemps count is 1 (as there will be no retry and hence no retry interval), shouldn't it either fail at startup complaining same or should be handled appropriately?.
It fails at run time when there are multiple failed records from different partitions with below error.
ERROR org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Error handler threw an exception
org.springframework.kafka.KafkaException: Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: <some IO Exception here, not one of them defined in FailedRecordProcessor.configureDefaultClassifier()>
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:157)
This seems to be with the below line.
Line 96 of FailedRecordTracker(i.e. if (nextBackOff != BackOffExecution.STOP) { )
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/FailedRecordTracker.java#L96
which subsequently is resulting in entry to line 157 of SeekUtils(i.e. throw new KafkaException("Seek to current after exception", level, thrownException);)
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/SeekUtils.java#L157

Perhaps you are migrating from an older version.
maxAttempts in FixedBackOff means max retry attempts so should be 0 for no retries.
See https://docs.spring.io/spring-kafka/docs/2.5.10.RELEASE/reference/html/#seek-to-current
Starting with version 2.3, a BackOff can be provided to the SeekToCurrentErrorHandler and DefaultAfterRollbackProcessor so that the consumer thread can sleep for some configurable time between delivery attempts. Spring Framework provides two out of the box BackOff s, FixedBackOff and ExponentialBackOff. The maximum back off time must not exceed the max.poll.interval.ms consumer property, to avoid a rebalance.
IMPORTANT: Previously, the configuration was "maxFailures" (which included the first delivery attempt). When using a FixedBackOff, its maxAttempts property represents the number of delivery retries (one less than the old maxFailures property). Also, maxFailures=-1 meant retry indefinitely with the old configuration, with a BackOff you would set the maxAttempts to Long.MAX_VALUE for a FixedBackOff and leave the maxElapsedTime to its default in an ExponentialBackOff.

Related

Spring camel kafka - Re-balancing and removing consumer

We have seen it where a consumer is removed from the consumer group, but I cant understand why.
As you can see from the errors below it suggests a timeout on Poll()
The TPS is less than 1, so very low, and each request takes around 200ms to ingest and push to DB.
This happened on 2 occasions in the within days of each other.
Result was that the service no longer read the message from the partition and a restart was required (Not good when you don't have alerting on offset buildup)
Any help/pointers would be greatly appreciated
Spring boot 2.5.13
Camel 3.16.0
2 Java applications (One in each DC)
1 Topic with 2 partitions
ERROR org.apache.camel.processor.errorhandler.DeadLetterChannel - log - Failed delivery for (MessageId: 4AA2CA19996CA12-000000000000424E on ExchangeId: 4AA2CA19996CA12-000000000000424E). On delivery attempt: 0 caught: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
WARN org.apache.camel.component.kafka.KafkaFetchRecords - handlePollErrorHandler - Deferring processing to the exception handler based on polling exception strategy
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - handle - [Consumer clientId=consumer-pdr-writer-service-2, groupId=pdr-writer-service] Offset commit failed on partition MY-TOPIC-0 at offset 166742: The coordinator is not aware of this member.
auto.commit.interval.ms = 5000
auto.offset.reset = latest
connections.max.idle.ms = 540000
session.timeout.ms = 10000
max.poll.interval.ms = 300000
max.poll.records = 500
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
group.id = a438f569-5701-4a83-885c-9111dfcbc743
group.instance.id = null
heartbeat.interval.ms = 3000
enable.auto.commit = true
A log we only saw once, at the same time we had these issues.
Requesting the consumer to retry polling the same message based on polling exception strategy
Exception org.apache.kafka.common.errors.TimeoutException caught while polling TOPIC-NAME-Thread 0 from kafka topic TOPIC-NAME at offset {TOPIC-NAME/1=166743}: Timeout of 5000ms expired before successfully committing offsets {TOPIC-NAME-1=OffsetAndMetadata{offset=166744, leaderEpoch=null, metadata=''}}
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - handle - [Consumer clientId=consumer-pdr-writer-service-2, groupId=pdr-writer-service] Offset commit failed on partition TOPIC-NAME-1 at offset 166744: The coordinator is not aware of this member.

messages duplicated during rebalancing after service recovery from Kafka SSLHandshakeException

Current setup - Our Springboot application consumes messages from Kafka topic,We are processing one message at a time (we are not using streams).Below are the config properties and version being used.
ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG- 30000
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG-earliest
ContainerProperties.AckMode-RECORD
Spring boot version-2.5.7
Spring-kafka version- 2.7.8
Kafks-clients version-2.8.1
number of partitions- 6
consumer group- 1
consumers- 2
Issue - When springboot application stays idle for longer time(idle time varying from 4 hrs to 3 days).We are seeing org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Exception error message - org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching kafka-2.broker.emh-dev.service.dev found.
2022-04-07 06:58:42.437 ERROR 24180 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : Authentication/Authorization Exception, retrying in 10000 ms
After service recover we are seeing message duplication with same partition and offsets which is inconsistent.
Below are the exception:
Consumer clientId=XXXXXX, groupId=XXXXXX] Offset commit failed on partition XXXXXX at offset 354: The coordinator is not aware of this member
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records

KafkaConsumer poll() behavior understanding

Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.

Spring-retry restarts after maxAttempts reached when no recover method is provided

Playing with spring-retry with spring-boot 1.5.21 and noticing that spring-retry restarts when maxAttempts is reached when there is no recover method implemented.
Works as expected if proper recover method is implemented. If no recover method, retry doesnt stop at maxAttempts, but restarts again. # of restarts is equal to configured maxAttempts. Eg, max attempts =3, retry will execute 9 times (running 3 retries * 3 restarts)
Using annotations to setup the retry block
#Retryable(include= {ResourceAccessException.class}, maxAttemptsExpression = "${retry.maxAttempts}", backoff = #Backoff(delayExpression = "${retry.delay}", multiplierExpression = "${retry.delay-multiplier}"))
expected results with a maxAttempts =3 is retry stops after 3 attempts
actual results is retry will restart the 3 attempts 3 more times, for a total of 9 retries.
The above occurs ONLY when no recover method is provided. Based on documentation, recover method is optional and i have no need for one, since there is no valid recovery in my case for a failed REST service call. (no redundant service available)
If there is no recoverer, the final exception is thrown.
If the source of the call is a listener container (e.g. RabbitMQ, JMS) then the delivery will be retried.
That's the whole point of a recoverer.

Kafka how to set producer retries to Infinity

How can I set the spring-boot property : spring.kafka.producer.retries to Integer.MAX_VALUE ?
Is it working to unset this property or this will default to 0 ?
#See default kafka in KIP
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
According to the Kafka docs it defaults to Integer.MAX_VALUE (at least with the current version), which concurs with the KIP.
Default value of ProducerConfig.RETRIES_CONFIG is 2147483647. Hope not defining the retries property will take care default value
By default it is 2147483647 which is Integer.MAX_VALUE you can set between [0,...,2147483647]
retries docs
Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first. Note additionally that produce requests will be failed before the number of retries has been exhausted if the timeout configured by delivery.timeout.ms expires first before successful acknowledgement. Users should generally prefer to leave this config unset and instead use 1delivery.timeout.ms1 to control retry behavior.

Resources