Spring-retry restarts after maxAttempts reached when no recover method is provided - spring-boot

Playing with spring-retry with spring-boot 1.5.21 and noticing that spring-retry restarts when maxAttempts is reached when there is no recover method implemented.
Works as expected if proper recover method is implemented. If no recover method, retry doesnt stop at maxAttempts, but restarts again. # of restarts is equal to configured maxAttempts. Eg, max attempts =3, retry will execute 9 times (running 3 retries * 3 restarts)
Using annotations to setup the retry block
#Retryable(include= {ResourceAccessException.class}, maxAttemptsExpression = "${retry.maxAttempts}", backoff = #Backoff(delayExpression = "${retry.delay}", multiplierExpression = "${retry.delay-multiplier}"))
expected results with a maxAttempts =3 is retry stops after 3 attempts
actual results is retry will restart the 3 attempts 3 more times, for a total of 9 retries.
The above occurs ONLY when no recover method is provided. Based on documentation, recover method is optional and i have no need for one, since there is no valid recovery in my case for a failed REST service call. (no redundant service available)

If there is no recoverer, the final exception is thrown.
If the source of the call is a listener container (e.g. RabbitMQ, JMS) then the delivery will be retried.
That's the whole point of a recoverer.

Related

seekToCurrentErrorHandler fails in case there are multiple failed records from different partitions if FixedBackOff is set as FixedBackOff(0L, 1)

With spring-kafka-2.5.4.RELEASE version, when there are multiple failed records from different partitions, seekToCurrentErrorHandler fails if FixedBackOff is set with maxAttempts as 1 and interval other than -1L.
SeekToCurrentErrorHandler seekToCurrentErrorHandler = new SeekToCurrentErrorHandler(,new FixedBackOff(0L, 1));
Although setting a value for interval other than -1L doesn't make sense when the maxAttemps count is 1 (as there will be no retry and hence no retry interval), shouldn't it either fail at startup complaining same or should be handled appropriately?.
It fails at run time when there are multiple failed records from different partitions with below error.
ERROR org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Error handler threw an exception
org.springframework.kafka.KafkaException: Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: <some IO Exception here, not one of them defined in FailedRecordProcessor.configureDefaultClassifier()>
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:157)
This seems to be with the below line.
Line 96 of FailedRecordTracker(i.e. if (nextBackOff != BackOffExecution.STOP) { )
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/FailedRecordTracker.java#L96
which subsequently is resulting in entry to line 157 of SeekUtils(i.e. throw new KafkaException("Seek to current after exception", level, thrownException);)
https://github.com/spring-projects/spring-kafka/blob/v2.5.4.RELEASE/spring-kafka/src/main/java/org/springframework/kafka/listener/SeekUtils.java#L157
Perhaps you are migrating from an older version.
maxAttempts in FixedBackOff means max retry attempts so should be 0 for no retries.
See https://docs.spring.io/spring-kafka/docs/2.5.10.RELEASE/reference/html/#seek-to-current
Starting with version 2.3, a BackOff can be provided to the SeekToCurrentErrorHandler and DefaultAfterRollbackProcessor so that the consumer thread can sleep for some configurable time between delivery attempts. Spring Framework provides two out of the box BackOff s, FixedBackOff and ExponentialBackOff. The maximum back off time must not exceed the max.poll.interval.ms consumer property, to avoid a rebalance.
IMPORTANT: Previously, the configuration was "maxFailures" (which included the first delivery attempt). When using a FixedBackOff, its maxAttempts property represents the number of delivery retries (one less than the old maxFailures property). Also, maxFailures=-1 meant retry indefinitely with the old configuration, with a BackOff you would set the maxAttempts to Long.MAX_VALUE for a FixedBackOff and leave the maxElapsedTime to its default in an ExponentialBackOff.

How Kafka retries works with request.timeout.?

I have configured my Producer with request.timeout.ms = 70,0000ms and retries=5. I have doubt how this actually works,
After this "request.timeout.ms=70,000" expires it retries for 5 times or within given "request.timeout.ms=70,000" it retries for 5 time with retry.backoff.ms value.?
There are 3 important configs to be aware of:
"request.timeout.ms" - time to retry a single request
"delivery.timeout.ms" - time to complete the entire send operation
"retries" - how many times to retry when the broker responds with retriable errors.
The Apache Kafka recommendation is to set "delivery.timeout.ms" and leave the other two configurations with their default value. The idea is that the main thing you as a user should worry about is how long you want to way for Kafka to figure things out before giving up on it. It doesn't really matter what is taking Kafka so long - the connection, getting metadata, long queues, etc, the only thing that matters is how long you are willing to wait.
Now to your question - request.timeout.ms applies on each retry. So Producer will send the recordbatch to Kafka, and if there's no response after 70,000ms it will consider this a failure and retry. Note that most errors (say, NoLeaderForPartition) will return from the broker much faster (which is why retry backoffs are needed).
Reasoning about delivery times with retries + request.timeout.ms turned out to be near impossible even for those who wrote the producer. Hence, the introduction of delivery.time.ms with a very clear contract.

Circuit Breaker for an orchestrator

I have a facade which calls 3 different services for some type of requests and finally orchestrates the responses before sending the response back to the client. Here, it is mandatory that all 3 services are up and serving as expected. The client request can not be served even one of them is down.
I am looking for a circuit breaker to solve this problem. The circuit breaker should respond with error code even one of the service is down. I was checking the resilence4j circuit breaker and it doesnt fit for my problem.
https://resilience4j.readme.io/docs/circuitbreaker
Is there any other open source available?
Why doesn't it fit to you problem?
You can protect every service with a CircuitBreaker. As soon one of the CircuitBreakers is open, you can short circuit and directly return an error response to your client.
CircuitBreaker Works on protected function as below –
Thread <—> CircuitBreaker <—> Protected_Function
So a Protected_Function can call 1 or more microservices, Mostly we use 1 Protected_Function for 1 external micro service call because we have can tune resilience based on the profile or behavior of that particular micro-service. But as your requirement is different so we can have 3 calls under 1 Protected_Function.
So as per your explanation above, you Façade is calling 3 micro-services (assume in series). What you can do is to call you Façade or all 3 services through or inside a Protected Function –
#CircuitBreaker(name = "OVERALL_PROTECTION")
public Your_Response Protected_Function (Your_Request) {
Call_To_Service_1;
Call_To_Service_2;
Call_To_Service_3;
return Orchestrate_Your_Response;
}
Further you can add resilience for OVERALL_PROTECTION in your YAML property file as below (I have used Count based Sliding Window) –
resilience4j.circuitbreaker:
backends:
OVERALL_PROTECTION:
registerHealthIndicator: true
slidingWindowSize: 100 # start rate calc after 100 calls
minimumNumberOfCalls: 100 # minimum calls before the CircuitBreaker can calculate the error rate.
permittedNumberOfCallsInHalfOpenState: 10 # number of permitted calls when the CircuitBreaker is half open
waitDurationInOpenState: 10s # time that the CircuitBreaker should wait before transitioning from open to half-open
failureRateThreshold: 50 # failure rate threshold in percentage
slowCallRateThreshold: 100 # consider all transactions under interceptor for slow call rate
slowCallDurationThreshold: 2s # if a call is taking more than 2s then increase the error rate
recordExceptions: # increment error rate if following exception occurs
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
You can also use time based slidingWindow instead of count based if you wish, Rest I have mentioned #Comment for self explanation in front of each parameter in configuration.
resilience4j.retry:
instances:
OVERALL_PROTECTION:
maxRetryAttempts: 5
waitDuration: 100
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
Above configuration will perform a retry for 5 times if Exceptions under retryExceptions occurs.
resilience4j.ratelimiter:
instances:
OVERALL_PROTECTION:
timeoutDuration: 100ms #The default wait time a thread waits for a permission
limitRefreshPeriod: 1000 #The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value
limitForPeriod: 25 #The number of permissions available during one limit refresh period
Above configuration will allow maximum up to 25 transactions in 1 second.

KafkaConsumer poll() behavior understanding

Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.

How JTA/JTS handle transaction time out issue?

Below is my understand that JTA/ JTS handle transaction time out issue. But I cannot find my document or material to back my understand. Is my understand right? Do u know any material is refer to this issue?
Application Server iterates through all the transactions to check timeout. If a transaction timeout occurs, application server marks roll back for the transaction, and log down the detail. But Application Server neither throws exception nor interrupts the transaction this moment. When the transaction thread continue to attempt to access another transactional resource (like JDBC/ JMS), the transactional resource which implements JTA interface will check roll back flag first before go further. Then at this moment, RollbackException is thrown.
==========
Test Case 1:
Set transaction timeout to 10 secs
I. Transaction begin
II. Sleep 20 secs
III. System out "Sleep end"
Result: Timeout occur at 10th secs, and system out log down the timeout detail, but not throw exception. "Sleep end" will be printed.
==========
Test Case 2:
Set transaction timeout to 10 secs
I. Transaction begin
II. Sleep 20 secs
III. Access db 1st time
IV. Access db 2nd time
V. System out "Sleep end"
Result: Timeout occur at 10th secs, and system out logs down the timeout detail, but not throw exception. Exception throws while access db 1st time. "Sleep end" will not be printed.
==========
Test Case 3:
Set transaction timeout to 10 secs
I. Transaction begin
II. Access db and db deadlock
Result: Timeout occur at 10th secs, and system out logs down the timeout detail. No exception throws, the transaction thread is stuck. So transaction timeout control cannot handle db timeout issue. I am so confused about this..
In my understanding, above behavior should be the same while using spring transaction management(JTA) and EJB. Am I right?
Thanks for ur help!
Tested, and proved that my understand should be correct.
Summarize the result as below:
• Transaction timeout control only affects transactional activities (Ex: access DB/ send JMS message).
• Application server do not interrupt current transaction thread immediately while timeout occurs, instead, application server only log down the detail. Timeout exception will throw while transaction commit or attempt to access next transactional activities.
• DB deadlock issue cannot be handled by transaction timeout control. But DB2 have deadlock prevent mechanism to release the deadlock and roll back transaction for some cases.

Resources