Resilience4j circuit breaker doesn't open when slowCallRateThreshold is reached? - spring-boot

I've decorated the following method with a Resilience4j CircuitBreaker:
#CircuitBreaker(name = "getSwaggerFileName")
private Optional<String> getSwaggerFileName(String url) {
return Optional.ofNullable(restTemplate.getForObject(url, SwaggerQueryResponse.class))
.flatMap(SwaggerQueryResponse::getFileName);
}
My circuit breaker config is stored in application.yaml:
resilience4j:
circuitbreaker:
configs:
default:
slowCallRateThreshold: 50
slowCallDurationThreshold: 1
slidingWindowSize: 20
waitDurationInOpenState: 60000
instances:
getSwaggerFileName:
baseConfig: default
This file is definitely being found, and a similar setup works fine with the #RateLimiter annotation. I've lowered the slowCallDurationThreshold from 2000ms to 1ms, and even tried introducing a Thread.sleep call to the method to verify it's taking longer than the duration threshold. The method runs around 100 times (edit: and the problem still occurs when slidingWindowSize and minimumNumberOfCalls are set well below this number).
But my circuit breaker doesn't open, after around 100 calls to this method. As I see it, it should open if more than 50% of those calls take more than 1ms. Am I missing something?

It's because the method is private. The method must be public so that it can be proxied.

Related

spring cloud stream kafka batch does not consume messages in every 15 minutes even after increasing this config, 'fetch.max.wait.ms'

I want to consume message in batch mode in every 15 minutes.
For that I have set these properties,
spring.cloud.stream.kafka.binder.consumer-properties.max.poll.records=5000000
spring.cloud.stream.kafka.binder.consumer-properties.fetch.max.wait.ms=900000
spring.cloud.stream.kafka.binder.consumer-properties.fetch.min.bytes=500000000
Consuming message works fine when I set this property spring.cloud.stream.kafka.binder.consumer-properties.fetch.max.wait.ms between 10000 to 30000, 10seconds or 30 seconds.
But If I increase the fetch.max.wait.ms to 1 minutes or more, It doesn't consumes messages even the waiting time is over.
I know the default value is 500ms, but will there be an issue if I increase that??
And How can I get the desired behaviour (consumer to wait for 10-15min before consuming the batch again)??
Can I use max.poll.interval.ms for that?
I was able to consume messages in every 15 minutes by setting these properties.
spring.cloud.stream.kafka.binder.consumer-properties.max.poll.interval.ms=1000000
And
setting idleTime between polls using container property.
#Bean
public ListenerContainerCustomizer<AbstractMessageListenerContainer<?,
?>> customizer() {
return (container, dest, group) ->
container.getContainerProperties().setIdleBetweenPolls(idlePollTimeout);
}

Retry with 30 minutes delay

I need to call a external rest service, if it fails on first attempt then I have to call again after 30 minutes. Max 3 time I can call like this.
I know spring has RetryTemplate for the retry. But I feel, for my cases its not fit. I have to call like this for more than 1000 records.
Any idea How can I achieve this in Spring.
Use a TaskScheduler.
scheduler.schedule(() -> { ... },
new Date(System.currentTimeMillis() + (30 * 60_000));
Keep track of how many times and if not exhausted, re-schedule.

Circuit Breaker for an orchestrator

I have a facade which calls 3 different services for some type of requests and finally orchestrates the responses before sending the response back to the client. Here, it is mandatory that all 3 services are up and serving as expected. The client request can not be served even one of them is down.
I am looking for a circuit breaker to solve this problem. The circuit breaker should respond with error code even one of the service is down. I was checking the resilence4j circuit breaker and it doesnt fit for my problem.
https://resilience4j.readme.io/docs/circuitbreaker
Is there any other open source available?
Why doesn't it fit to you problem?
You can protect every service with a CircuitBreaker. As soon one of the CircuitBreakers is open, you can short circuit and directly return an error response to your client.
CircuitBreaker Works on protected function as below –
Thread <—> CircuitBreaker <—> Protected_Function
So a Protected_Function can call 1 or more microservices, Mostly we use 1 Protected_Function for 1 external micro service call because we have can tune resilience based on the profile or behavior of that particular micro-service. But as your requirement is different so we can have 3 calls under 1 Protected_Function.
So as per your explanation above, you Façade is calling 3 micro-services (assume in series). What you can do is to call you Façade or all 3 services through or inside a Protected Function –
#CircuitBreaker(name = "OVERALL_PROTECTION")
public Your_Response Protected_Function (Your_Request) {
Call_To_Service_1;
Call_To_Service_2;
Call_To_Service_3;
return Orchestrate_Your_Response;
}
Further you can add resilience for OVERALL_PROTECTION in your YAML property file as below (I have used Count based Sliding Window) –
resilience4j.circuitbreaker:
backends:
OVERALL_PROTECTION:
registerHealthIndicator: true
slidingWindowSize: 100 # start rate calc after 100 calls
minimumNumberOfCalls: 100 # minimum calls before the CircuitBreaker can calculate the error rate.
permittedNumberOfCallsInHalfOpenState: 10 # number of permitted calls when the CircuitBreaker is half open
waitDurationInOpenState: 10s # time that the CircuitBreaker should wait before transitioning from open to half-open
failureRateThreshold: 50 # failure rate threshold in percentage
slowCallRateThreshold: 100 # consider all transactions under interceptor for slow call rate
slowCallDurationThreshold: 2s # if a call is taking more than 2s then increase the error rate
recordExceptions: # increment error rate if following exception occurs
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
You can also use time based slidingWindow instead of count based if you wish, Rest I have mentioned #Comment for self explanation in front of each parameter in configuration.
resilience4j.retry:
instances:
OVERALL_PROTECTION:
maxRetryAttempts: 5
waitDuration: 100
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
Above configuration will perform a retry for 5 times if Exceptions under retryExceptions occurs.
resilience4j.ratelimiter:
instances:
OVERALL_PROTECTION:
timeoutDuration: 100ms #The default wait time a thread waits for a permission
limitRefreshPeriod: 1000 #The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value
limitForPeriod: 25 #The number of permissions available during one limit refresh period
Above configuration will allow maximum up to 25 transactions in 1 second.

SimpleMeterRegistry clears data if data not polled every minute

I have a simple spring boot app with the following config (the project is available here on GitHub):
management:
metrics:
export:
simple:
mode: step
endpoints:
web:
exposure:
include: "*"
The above config creates SimpleMeterRegistry and configures its metrics to be step-based, with 60 seconds step. I have one script that sends 50-100 requests per second to the service dummy endpoint and there's the other script that polls the data from /actuator/metrics/http.server.requests every X seconds. When I run the latter script every 60 seconds everything works as expected, but when the script is run every 120 seconds, the response always contains zeros for TOTAL_TIME and COUNT metrics.
Can anyone explain this behavior?
I have read the documentation here. The picture below
could indicate that a registry will try to aggregate the data for the previous interval only if pollAsRate is called during the current interval. This will explain why it does not work for 120 seconds interval. But this is just my assumption, does anyone know what is really happening here?
Spring boot version: 2.1.7.RELEASE
UPDATE
I did a similar test with management.metrics.export.simple.step=10s, it works fine when polling interval is 10s and not working when it is 20s. For 15s interval it sporadically works. So, it's definitely related to the step size and polling frequency.
MAX, TOTAL_TIME, COUNT is the property of Statistic.
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)) which sets the some measutement to 0 if there is no request has been made for last 2 minutes (120 seconds)
Methods such as public TimeWindowMax(Clock clock,...), private void rotate() has been written for the same. You may see the implementation here
More Detailed Answer
Finally figured out what is happening.
On every request to /actuator/metrics, MetricsEndpoint is going to merge measures (see here). That is done by collecting values for all meters with measurement.getValue(). The StepMeasurement.getValue() will not simply return the value, it will update the current and the previous intervals and counts, and roll the count (see here and here).
StepMeasurement.getValue
public double getValue() {
double absoluteCount = (Double)this.f.get();
double inc = Math.max(0.0D, absoluteCount - this.lastCount.sum());
this.lastCount.add(inc);
this.value.getCurrent().add(inc);
return this.value.poll();
}
StepDouble.poll
public double poll() {
rollCount(clock.wallTime());
return previous;
}
How is this related to the polling interval? If you do not poll /actuator/metrics endpoint, the current and previous intervals will not be updated, thus resulting in the current interval not being up-to-date and metrics being recorded for the "wrong" interval.

Spring boot with feign and hystrix: Can't get request timeouts to work

I'm having problems getting hystrix timeouts to work. I've created an example project to show this here: https://github.com/stianlagstad/spring-boot-timeout-demo.
In bootstrap.yml I'm setting a timeout like this:
hystrix:
command:
default:
execution.isolation.thread.timeoutInMilliseconds: 60000
circuitBreaker:
enabled: true
sleepWindowInMilliseconds: 300000
fallback.enabled: false
# My client
MyFeignClient#getPost:
execution.isolation.thread.timeoutInMilliseconds: 1
I expect the result of this to be that hystrix commands should timeout after 60 seconds, except for getPost in MyFeignClient which should timeout after 1 millisecond. I'm not seeing that, though. The getPost method returns an answer every time, and I'm pretty sure it takes longer than one millisecond.
I've also tried to set the timeout manually in a test using ConfigurationManager, but that doesn't seem to work either: https://github.com/stianlagstad/spring-boot-timeout-demo/blob/master/src/test/java/com/example/TimeoutDemoApplicationTests.java
How can I make the timeouts I'm setting take effect?
You need to fix your properties in two places.
First, add the below property. From dalston release, feign's hystrix support is optional. You already have hystrix on your classpath, so all you need to do is just adding the below property.
feign:
hystrix:
enabled: true
Second, you specified wrong HystrixCommandKey for your feign. You need to change your HystrixCommandKey like below.
MyFeignClient#getPost():
You need parentheses after #getPost.

Resources