We're using resilience4j library (Spring Boot 2 starter) and trying to use annotations to apply resilience patterns (as this is the best way which doesn't interfere with the business logic).
Could someone please (maybe even those guys who're developing this lib) clarify the following:
The default aspect order is "Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) )". At the same time it's not possible to change bulkhead aspect order - it always goes last - and I'm not sure I understand a motivation for that.
For example - shouldn't an actual IO call be retried (this is how it's typically done) within the same thread (i.e. already within a thread-pool bulkhead) rather than simply re-submit a task to the thread-pool?
While the previous statement is arguable and probably makes sense in some scenarios - then it's absolutely not clear what is the sense of putting RateLimiter logic before bulkhead - rate limiter should limit actual IO calls and not thread-pool submission requests, shouldn't it?
I understand we can do it programmatically or split logic to multiple components and put #Bulkhead at first and rest of annotations on second - but still curious why was it decided to implement such default aspect order (where bulkhead always goes last) - as it seems to be impractical (unless I'm missing smth)
Thanks in advance
Related
I am attempting to accomplish something along these lines with Quarkus, and Naryana:
client calls service to start a process that takes a while: /lra/start
This call sets off an LRA, and returns an LRA id used to track the status of the action
client can keep polling some endpoint to determine status
service eventually finishes and marks the action done through the coordinator
client sees that the action has completed, is given the result or makes another request to get that result
Is this a valid use case? Am I visualizing the correct way this tool can work? Based on how the linked guide reads, it seems that the endpoints are more of a passthrough to the coordinator, notifying it that we start and end an LRA. Is there a more programmatic way to interact with the coordinator?
Yes, it might be a valid use case, but in every case please read the MicroProfile LRA specification - https://github.com/eclipse/microprofile-lra.
The idea you describe is more or less one LRA participant executing in a new LRA and polling the status of this execution. This is not totally what the LRA is intended for, but surely can be used this way.
The main idea of LRA is the composition of distributed transactions based on the saga pattern. Basically, the point is to coordinate multiple services to achieve consistent results with an eventual consistency guarantee. So you see that the main benefit arises when you can propagate LRA through different services that either all complete their actions or all of their compensation callbacks will be called in case of failures (and, of course, only for the services that executed their actions in the first place). Here is also an example with the LRA propagation https://github.com/xstefank/quarkus-lra-trip-example.
EDIT: Sorry, I forgot to add the programmatic API that allows same interactions as annotations - https://github.com/jbosstm/narayana/blob/master/rts/lra/client/src/main/java/io/narayana/lra/client/NarayanaLRAClient.java. However, note that is not in the specification and is only specific to Narayana.
Can we use both together in Spring Boot during the development of microservice?
These are fundamentally different patterns.
A circuit breaker pattern is implemented on the caller, to avoid overwhelming a service which may be struggling to handle calls. A sample implementation in Spring can be found here.
A bulkhead pattern is implemented on the service, to prevent a failure during the handling of a single incoming call impacting the handling of other incoming calls. A sample implementation in Spring can be found here.
The only thing these patters have in common is that they are both designed to increase the resilience of a distributed system.
While you can certainly use them together in the same service, you must understand that they are not related to each other, as one is concerned with making calls and the other is concerned with handling calls.
Yes, they can be used together, but it's not always necessary.
As #tom redfern said, circuit breaker is implemented on the caller side. So, if you are sending request to another service, you should wrap those requests into a circuit breaker specific to that service. Keep in mind that every other third party system or service should have it's own circuit breaker. Otherwise, the unavailability of one system will impact the requests that you are sending to the other by opening the circuit breaker.
More informations about circuit breaker can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
Also, #tom redfern is right again in the case of bulkheading, this is a pattern which is implemented in the service that is called. So, if you are reacting to external requests by spanning other multiple requests or worloads, you should avoid doing all those worloads into a single unit (thread). Instead, separate the worloads into pieces (thread pools) for each request that you have spanned.
More information about bulkheading can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead
Your question was if it's possible to use both these patterns in the same microservice. The answer is: yes, you can and very often the situation implies this.
In the traditional way of writing an application, I have divided the application into a set of tasks and execute them sequentially.
Get a list of rules for a given rule group from Redis
Construct facts input and fire the rules.
compute a response for the request by hitting multiple rules (Rule group A might depend on the Rule group B result).
send the response back to the caller.
If I was to implement the above steps using the spring web flux reactive manner, how do I achieve it?
I have used ReactiveRedis to get the data from redis.
ReactiveRedisOperations.opsForValue().get(ruleGroupName) does not return anything until we subscribe() to it. But ReactiveRedisOperations.opsForValue().get(ruleGroupName).subscribe() makes the processing thread reactive and the execution goes to next line in the application without waiting for the Subscriber to execute.
As my next steps depend on the data returned by Redis, I have used the block() option to make it wait.
In the real-world how does one tackle a situation like this? Thanks in advance.
PS: New to spring web flux and reactive programming.
Instead of separating logical steps by putting them on a new line, like in imperative programming, reactive programming uses method composition and chaining (the operators).
So once you get a Flux<T> or a Mono<T> (here your rules from Redis), you need to chain operators to build up your processing steps in a declarative manner.
A step that transforms each input element <T> into a single corresponding element <R>, in memory and without latency, is typically expressed as a map(Function<T, R>) and produces a Flux<R>. In turn, chain further operators on that.
A step that either transforms 1 element <T> to N elements <R> and/or does so asynchronously (ie the transformation returns a Flux<R> for each T) is typically expressed as a flatMap(Function<T, Publisher<R>>).
Beyond that, there is a rich vocabulary of specialized operators in Reactor that you can explore.
In the end, your goal is to chain all these operators to describe your processing pipeline, which is going to be a Mono<RETURN_TYPE> or Flux<RETURN_TYPE>. RETURN_TYPE in webflux can either be a business type that Spring can marshall or one of the Spring response-oriented classes.
I would like to understand how to detect the failed service ( in a fast / reliably manner ), ie the service what is a root of all 5xx responses?
Let me try to elaborate. Lets assume we have 300+ microservices and they have only synchroneous http interaction via GET request without any data modifications ( we assume it for simplicity ). Each customer request may transform in calling ~10 different microservices, moreover it could be a 'calling chain' of requests, ie API Gateway calls 3 different microservices, each of them calls 1-5 more, each of these 1-5 calls 1-5 more etc.
We closely monitor 5xx errors on each of microservice and react on these errors.
Now one of the microservices fails. It appears to be somewhere in the end of a 'calling chain', which means that other microservices which depend on it will start to return 5xx as well.
Yes, there are circuit breakers, yes they become 'triggered / opened' and instead of calling the downstream service, they right away return error as well ( in most cases we cannot return a good fallback like empty response ).
So we see that relatively big amount of microservices return 5xx. Like 30-40 microservices return 5xx, we see 30-40 triggered / opened circuit breakers.
How to detect a failed microservice, a root of all evil, in a fast manner?
Did anybody encounter this issue?
Regards
You will need to implement a distributed tracing solution that tracks the origin transaction with a global ID. The name of this global identifier is typically called Correlation ID and it is generated by the very first service which creates the request and propagated to all the other microservices that work together to fulfill the request.
Take a look at OpenTracing for your implementation needs. It provides libraries for you to add the instrumentation required for identifying faulty microservices in a distributed environment.
However, if you really do have 300 microservices all using synchronous calls...maybe it is time to consider using asynchronous communications to eliminate the temporal coupling inherent in synchronous communications.
I am starting a project using spring webflux reactive stack which by default uses Reactor Netty as the server. Pls correct me if i'm wrong, but i read that Netty can only have maximum number of event loops as the amount of processors on the instance.
This means that if a request gets blocked for a second (which should not be the use case i know, just for example), we would only be able to get max 1 Transaction Per Second if there is only 1 processor on instance.
I am wondering how scalable Netty is compared to servlet container like Tomcat? What are the pros and cons of using Netty vs Tomcat?
I also want to know the ways to optimize Netty configurations to make sure it is production ready.
This means that if a request gets blocked for a second (which should not be the use case i know, just for example)
the whole purpose of this stack is to scale hugely on a limited amount of resources (here, threads). This is all built on the critical requirement that every step is asynchronous and non-blocking.
So your "just for example" doesn't make any sense. Yes, if you block for one second that CPU will only process that single request during that second. That is also completely wrong of you to do so, and everything in the stack is made to help you avoid blocking.