minimumNumberOfCalls not working in resilience4j - spring-boot

Using springboot 2.4 and resilience4j 1.5,
i have configured my yaml file,
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 5
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 50s
failureRateThreshold: 50
eventConsumerBufferSize: 10
instances:
movieCatalog:
baseConfig: default
and in the movieCatalog instance,
#RequestMapping("/{userId}")
#CircuitBreaker(name = CATALOG_SERVICE, fallbackMethod="fallBackCatalog")
public List<CatalogItem> getCatalog(#PathVariable("userId") String userId) {
UserRating ratings = restTemplate.getForObject("http://ratings-data-service/ratingsdata/users/"+userId, UserRating.class);
return ratings.getUserRatings().stream()
.map(rating -> {
Movie movie = restTemplate.getForObject("http://movie-info-sevice/movies/" + rating.getMovieId(), Movie.class);
return new CatalogItem(movie.getName(), movie.getDescription(), rating.getRating());
})
.collect(Collectors.toList());
}
private List<CatalogItem> fallBackCatalog(Exception e) {
List<CatalogItem> fallBack = new ArrayList<>();
fallBack.add(new CatalogItem("movie1", "movie desc", 3));
return fallBack;
}
i see that when i get an exception in the above getCatalog method the i get the fallback result immediately on the first call. my understanding is for the first 5 calls i should see an exception and from the 6 th call since more than 50% of the calls are exceptions(100% errors) i should see the fallback result in the 6th call. i have several errors before configuring fall back method, is there a cache which records the previous calls, and i guess if at all there is cache it should be cleared when the spring boot app is restarted right ? please explain if im missing something. any pointers are greatly appreciated.

The fallback mechanism is like a try/catch. It's independent of your CircuitBreaker configuration.
If you only want to execute a fallback method when the CircuitBreaker is open, then narrow down the scope from Exception to CallNotPermittedException.

Resilience4J will fail-fast by throwing a CallNotPermittedException until the state changes to closed or according to our configuration.
So with a fallback method, when the circuit breaker trips to Open state, it will no longer throw a CallNotPermittedException but instead will return the response INTERNAL_SERVER_ERROR.
I also agree with #Robert Winkler

Related

Writing blocking operations in reactor tests with Spring and State Machine

I'm completely new to reactor programming and I'm really struggling with migrating old integration tests since upgrading to the latest Spring Boot / State Machine.
Most Integration tests have the same basic steps :
Call a method that returns a Mono and starts a state Machine and returns an object containing a generated unique id as well as some other infos related to the initial request.
With the returned object call a method that verifies if a value has been updated in the database (using the information of the object retried in step 1)
Poll at a fixed interval the method that checks in the database if value has changed until either the value has changed or a predefined timeout occurs.
Check another table in the database if another object has been updated
Below an example:
#Test
void testEndToEnd() {
var instance = ServiceInstance.buildDefault();
var updateRequest = UpdateRequest.build(instance);
// retrieve an update Response related to the request
// since a unique id is generated when triggering the update request
// before starting a stateMachine that goes through different steps
var updateResponse = service.updateInstance(updateRequest).block();
await().alias("Check if operation was successful")
.atMost(Duration.ofSeconds(120))
.pollInterval(Duration.ofSeconds(2))
.until(() -> expectOperationState(updateResponse, OperationState.SUCCESS))
// check if values are updated in secondary table
assertValuesInTransaction(updateResponse);
}
This was working fine before but ever since the latest update where it fails with the exception :
java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread parallel-6
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:83)
at reactor.core.publisher.Mono.block(Mono.java:1710)
I saw that a good practice to test reactor methods using StepVerifier but I do not see how I can reproduce the part done with Awaitability to poll to see if the value has changed in the DB since the method that checks in the DB returns a Mono and not a flux that keeps sending values.
Any idea on how to accomplish this or to make the spring stack accept blocking operations?
Thanks
My current stack :
Spring Boot 3.0.1
Spring State Machine 3.0.1
Spring 6
Junit 5.9.2
So as discussed in comments here is an example with comments. I used flatMap to subscribe to what expectOperationState returns. Also there is Mono.fromCallable used which check the value from some method and if it fails to emit anything in 3 seconds - the timeout exception is thrown. Also we could try to get rid of this boolean value from expectOperationState and refactor the code to just return Mono<Void> with completed signal but this basically shows how you can achieve what you want.
class TestStateMachine {
#Test
void testUntilSomeOperationCompletes() {
final Service service = new Service();
final UpdateRequest updateRequest = new UpdateRequest();
StepVerifier.create(service.updateInstance(updateRequest)
.flatMap(updateResponse -> expectOperationState(updateResponse, OperationState.SUCCESS))
)
.consumeNextWith(Assertions::assertTrue)
.verifyComplete();
}
private Mono<Boolean> expectOperationState(final UpdateResponse updateResponse, final OperationState success) {
return Mono.fromCallable(() -> {
while (true) {
boolean isInDb = checkValueFromDb(updateResponse);
if (isInDb) {
return true;
}
}
})
.publishOn(Schedulers.single())
//timeout if we not receive any value from callable within 3 seconds so that we do not check forever
.timeout(Duration.ofSeconds(3));
}
private boolean checkValueFromDb(final UpdateResponse updateResponse) {
return true;
}
}
class Service {
Mono<UpdateResponse> updateInstance(final UpdateRequest updateRequest) {
return Mono.just(new UpdateResponse());
}
}
Here is an example without using Mono<Boolean> :
class TestStateMachine {
#Test
void test() {
final Service service = new Service();
final UpdateRequest updateRequest = new UpdateRequest();
StepVerifier.create(service.updateInstance(updateRequest)
.flatMap(updateResponse -> expectOperationState(updateResponse, OperationState.SUCCESS).timeout(Duration.ofSeconds(3)))
)
.verifyComplete();
}
private Mono<Void> expectOperationState(final UpdateResponse updateResponse, final OperationState success) {
return Mono.fromCallable(() -> {
while (true) {
boolean isInDb = checkValueFromDb(updateResponse);
if (isInDb) {
//return completed Mono
return Mono.<Void>empty();
}
}
})
.publishOn(Schedulers.single())
//timeout if we not receive any value from callable within 3 seconds so that we do not check forever
.timeout(Duration.ofSeconds(3))
.flatMap(objectMono -> objectMono);
}
private boolean checkValueFromDb(final UpdateResponse updateResponse) {
return true;
}
}

AmqpResourceNotAvailableException: The channelMax limit is reached. Try later

I know that this error was mentioned in other posts as (https://github.com/spring-projects/spring-amqp/issues/999 and https://github.com/spring-projects/spring-amqp/issues/853 ) but I haven't found the solution for me.
My project has defined a set of microservices that publishing and consuminng messages by queue. when I
run my stress tests with 200 Transaction per seconds I get this error:
The channelMax limit is reached. Try later
This error is only shown in one microservice in the rest no.
I am using in my project:
spring-boot-starter-amqp.2.3.4.RELEASE
spring-rabbit:2.2.11
and my setting of rabbit is :
public ConnectionFactory publisherConnectionFactory() {
final CachingConnectionFactory connectionFactory = new
CachingConnectionFactory(rabbitMQConfigProperties.getHost(), rabbitMQConfigProperties.getPort());
connectionFactory.setUsername(rabbitMQConfigProperties.getUser());
connectionFactory.setPassword(rabbitMQConfigProperties.getPass());
connectionFactory.setPublisherReturns(true);
connectionFactory.setPublisherConfirms(true);
connectionFactory.setConnectionNameStrategy(connecFact -> rabbitMQConfigProperties.getNameStrategy());
connectionFactory.setRequestedHeartBeat(15);
return connectionFactory;
}
#Bean(name = "firstRabbitTemplate")
public RabbitTemplate firstRabbitTemplate(MessageDeliveryCallbackService messageDeliveryCallbackService) {
final RabbitTemplate template = new RabbitTemplate(publisherConnectionFactory());
template.setMandatory(true);
template.setMessageConverter(jsonMessageConverter());
template.setReturnCallback((msg, i, s, s1, s2) -> {
log.error("Publisher Unable to deliver the message {} , Queue {}: --------------", s1, s2);
messageDeliveryCallbackService.returnedMessage(msg, i, s, s1, s2);
});
template.setConfirmCallback((correlationData, ack, cause) -> {
if (!ack) {
log.error("Message unable to connect Exchange, Cause {}: ack{}--------------", cause,ack);
}
});
return template;
}
My questions are :
Should I set up the ChannelCacheSize and setChannelCheckoutTimeout?. I did a test increasing the channelCacheSize to 50 but the issue is still happenning. What would it be the best value for these parameters as per I mentioned it earlier?. I read about channelCheckoutTimeout should be higher than 0 but I don't know what value i must set up.
Right now i am processing around 200 Transaction per second but this number will be increased progressly
Thank you in advance.
channel_max is negotiated between the client and server and applies to connections. The default is 2047 so it looks like you broker has imposed a lower limit.
https://www.rabbitmq.com/channels.html#channel-max
When using publisher confirms, returning channels to the cache is delayed until the confirm is received; hence more channels are generally needed when the volume is high.
You can either reconfigure the broker to allow more channels, or change the CacheMode to CONNECTION instead of the default (CHANNEL).
https://docs.spring.io/spring-amqp/docs/current/reference/html/#cachingconnectionfactory

Spring Sleuth - Tracing Failures

In a microservice environment I see two main benefits from tracing requests through all microservice instances over an entire business process.
Finding latency gaps between or in service instances
Finding roots of failures, whether technical or regarding the business case
With Zipkin there is a tool, which addresses the first issue. But how can tracing be used to unveil failures in your microservice landscape? I definitely want to trace all error afflicted spans, but not each request, where nothing went wrong.
As mentioned here a custom Sampler could be used.
Alternatively, you may register your own Sampler bean definition and programmatically make the decision which requests should be sampled. You can make more intelligent choices about which things to trace, for example, by ignoring successful requests, perhaps checking whether some component is in an error state, or really anything else.
So I tried to implement that, but it doesn't work or I used it wrong.
So, as the blog post suggested I registered my own Sampler:
#Bean
Sampler customSampler() {
return new Sampler() {
#Override
public boolean isSampled(Span span) {
boolean isErrorSpan = false;
for(String tagKey : span.tags().keySet()){
if(tagKey.startsWith("error_")){
isErrorSpan = true;
}
}
return isErrorSpan ;
}
};
}
And in my controller I create a new Span, which is being tagged as an error if an exception raises
private final Tracer tracer;
#Autowired
public DemoController(Tracer tracer) {
this.tracer = tracer;
}
#RequestMapping(value = "/calc/{i}")
public String calc(#PathVariable String i){
Span span = null;
try {
span = this.tracer.createSpan("my_business_logic");
return "1 / " + i + " = " + new Float(1.0 / Integer.parseInt(i)).toString();
}catch(Exception ex){
log.error(ex.getMessage(), ex);
span.logEvent("ERROR: " + ex.getMessage());
this.tracer.addTag("error_" + ex.hashCode(), ex.getMessage());
throw ex;
}
finally{
this.tracer.close(span);
}
}
Now, this doesn't work. If I request /calc/a the method Sampler.isSampled(Span) is being called before the Controller method throws a NumberFormatException. This means, when isSampled() checks the Span, it has no tags yet. And the Sampler method is not being called again later in the process. Only if I open the Sampler and allow every span to be sampled, I see my tagged error-span later on in Zipkin. In this case Sampler.isSampled(Span) was called only 1 time but HttpZipkinSpanReporter.report(Span) was executed 3 times.
So what would the use case look like, to transmit only traces, which have error spans ? Is this even a correct way to tag a span with an arbitrary "error_" tag ?
The sampling decision is taken for a trace. That means that when the first request comes in and the span is created you have to take a decision. You don't have any tags / baggage at that point so you must not depend on the contents of tags to take this decision. That's a wrong approach.
You are taking a very custom approach. If you want to go that way (which is not recommended) you can create a custom implementation of a SpanReporter - https://github.com/spring-cloud/spring-cloud-sleuth/blob/master/spring-cloud-sleuth-core/src/main/java/org/springframework/cloud/sleuth/SpanReporter.java#L30 . SpanReporter is the one that is sending spans to zipkin. You can create an implementation that will wrap an existing SpanReporter implementation and will delegate the execution to it only when some values of tags match. But from my perspective it doesn't sound right.

Hystrix circuit not closing after downstream service recovers

I'm playing with the bootiful-microservice project by Josh Long. (Brixton subproject)
On the reservation-service I have added a simple status method that can sleep a configurable amount of time to simulate load:
#RequestMapping(method = RequestMethod.GET, value = "/status")
public String status(){
System.out.println("Checking status");
try {
Thread.sleep((long) (rand.nextDouble()*sleepTime));
} catch (InterruptedException e) {
e.printStackTrace();
}
return "All is good";
}
The sleepTime variable is pulled from the Spring Config Server
On the reservation-client I have added an entry point in the gateway:
#FeignClient("reservation-service")
interface ReservationReader {
#RequestMapping(method = RequestMethod.GET, value = "/reservations")
Resources<Reservation> readReservations();
#RequestMapping(method = RequestMethod.GET, value = "/status")
String status();
}
}
and I'm using an HystrixCommand
#HystrixCommand(fallbackMethod = "statusFallback")
#RequestMapping(method = RequestMethod.GET, value = "/status")
public String status(){
return reader.status();
}
public String statusFallback(){
return "Bad";
}
This all works well.
I set the sleeping time to 1500ms so that some request will be above the Hystrix default threshold (1000ms).
When I start hitting the API I get some failures due to timeout. If I hit long enough (50 times seems to work) the circuit breaker trigger and the circuit becomes open:
My understanding is that as the downstream service becomes healthy again Hystrix will try to route 1 call and use it as a health check. If the call is successful circuit should be closed again.
However this is not happening here. The circuit will remain open even after changing the sleeping time to a smaller value (let's say 500ms). None of my calls are routed towards the reservation-services and the fallback is used on every call. The only way I can get the circuit to close again is to restart the reservation-client service.
Did I miss something? Is it an issue with Hystrix? Or with the Spring Integration?
UPDATE
I did further testing and I can confirm that the circuit will remain close forever, even after the sleeping has been reduced.
However if I use a route in the Zuul configuration I get the expected behaviour. The circuit closes itself if it sees a request that doesn't time out.
I have noticed another difference between forwarding by route compare to manually doing it in Spring. If I create a filter my /status/ call on the client does not trigger the filter. When I setup a route (eg. /foos/status => /status) it will trigger the filter and Hystrix behaves properly.
Is that a bug in Spring?

Hibernate sessions fetch stale data when using async controllers and long polling

I have a problem with async controllers in Grails. Consider the following controller:
#Transactional(readOnly=true)
class RentController {
def myService
UserProperties props
def beforeInterceptor = {
this.props = fetchUserProps()
}
//..other actions
#Transactional
def rent(Long id) {
//check some preconditions here, calling various service methods...
if (!allOk) {
render status: 403, text: 'appropriate.message.key'
return
}
//now we long poll because most of the time the result will be
//success within a couple of seconds
AsyncContext ctx = startAsync()
ctx.timeout = 5 * 1000 * 60 + 5000
ctx.start {
try {
//wait for external service to confirm - can take a long time or even time out
//save appropriate domain objects if successful
//placeRental is also marked with #Transactional (if that makes any difference)
def result = myService.placeRental()
if (result.success) {
render text:"OK", status: 200
} else {
render status:400, text: "rejection.reason.${result.rejectionCode}"
}
} catch (Throwable t) {
log.error "Rental process failed", t
render text: "Rental process failed with exception ${t?.message}", status: 500
} finally {
ctx.complete()
}
}
}
}
The controller and service code appear to work fine (though the above code is simplified) but will sometimes cause a database session to get 'stuck in the past'.
Let's say I have a UserProperties instance whose property accountId is updated from 1 to 20 somewhere else in the application while a rent action is waiting in the async block. As the async block eventually terminates one way or another (it may succeed, fail or time out), the app will sometimes get a stale UserProperties instance with accountId: 1. Let's say I refresh the updated user's properties page, I will see accountId: 1 about 1 time per 10 refreshes while the rest of the time it will be 20 - and this is on my development machine where noone else is accessing the application (though the same behaviour can be observed in production). My connection pool also holds 10 connections so I suspect there may be a correlation here.
Other strange things will happen - for example, I will get StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect) from actions doing something as simple as render (UserProperties.list() as JSON) - after the response had already rendered (successfuly apart from the noise in the logs) and despite the action being annotated with #Transactional(readOnly=true).
A stale session doesn't seem to appear every time and so far our solution was to restart the server every evening (the app has few users for now), but the error is annoying and the cause was hard to pinpoint. My guess is that a DB transaction doesn't get committed or rolled back because of the async code, but GORM, Spring and Hibernate have many nooks and crannies where things could get stuck.
We're using Postgres 9.4.1 (9.2 on a dev machine, same problem), Grails 2.5.0, Hibernate plugin 4.3.8.1, Tomcat 8, Cache plugin 1.1.8, Hibernate Filter plugin 0.3.2 and the Audit Logging plugin 1.0.1 (other stuff too, obviously, but this feels like it could be relevant). My datasource config contains:
hibernate {
cache.use_second_level_cache = true
cache.use_query_cache = false
cache.region.factory_class = 'org.hibernate.cache.ehcache.SingletonEhCacheRegionFactory'
singleSession = true
flush.mode = 'manual'
format_sql = true
}
Grails bug. And a nasty one, everything seems OK until your app starts acting funny in completely unrelated parts of the app.

Resources