What is the difference between a circuit breaker and a bulkhead pattern? - spring-boot

Can we use both together in Spring Boot during the development of microservice?

These are fundamentally different patterns.
A circuit breaker pattern is implemented on the caller, to avoid overwhelming a service which may be struggling to handle calls. A sample implementation in Spring can be found here.
A bulkhead pattern is implemented on the service, to prevent a failure during the handling of a single incoming call impacting the handling of other incoming calls. A sample implementation in Spring can be found here.
The only thing these patters have in common is that they are both designed to increase the resilience of a distributed system.
While you can certainly use them together in the same service, you must understand that they are not related to each other, as one is concerned with making calls and the other is concerned with handling calls.

Yes, they can be used together, but it's not always necessary.
As #tom redfern said, circuit breaker is implemented on the caller side. So, if you are sending request to another service, you should wrap those requests into a circuit breaker specific to that service. Keep in mind that every other third party system or service should have it's own circuit breaker. Otherwise, the unavailability of one system will impact the requests that you are sending to the other by opening the circuit breaker.
More informations about circuit breaker can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
Also, #tom redfern is right again in the case of bulkheading, this is a pattern which is implemented in the service that is called. So, if you are reacting to external requests by spanning other multiple requests or worloads, you should avoid doing all those worloads into a single unit (thread). Instead, separate the worloads into pieces (thread pools) for each request that you have spanned.
More information about bulkheading can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead
Your question was if it's possible to use both these patterns in the same microservice. The answer is: yes, you can and very often the situation implies this.

Related

Implement Spring batch circuit breaker

I am building a Spring batch job and in the Item processor step I am consuming an external end-point and saving the values to DB. The external point at times is very slow and takes more than 60 sec to respond. So, as a work around I implemented restTemplate timeout(15s) but, how to implement circuit breaker techniques here. As a result of this my transaction is timing out (even after implementing timeout). Are there any solutions to overcome this out of box in spring-batch.
how to implement circuit breaker techniques here
You can annotate the ItemProcessor#process with #CircuitBreaker (see attributes like maxAttempts, resetTimeout, etc) from the spring-retry library and add a recovery method that you annotate with #Recover.
Michael Minella gives a complete sample of this very scenario in his talk: Cloud Native Batch Processing. And you can find the code example here.

Use Cases for LRA

I am attempting to accomplish something along these lines with Quarkus, and Naryana:
client calls service to start a process that takes a while: /lra/start
This call sets off an LRA, and returns an LRA id used to track the status of the action
client can keep polling some endpoint to determine status
service eventually finishes and marks the action done through the coordinator
client sees that the action has completed, is given the result or makes another request to get that result
Is this a valid use case? Am I visualizing the correct way this tool can work? Based on how the linked guide reads, it seems that the endpoints are more of a passthrough to the coordinator, notifying it that we start and end an LRA. Is there a more programmatic way to interact with the coordinator?
Yes, it might be a valid use case, but in every case please read the MicroProfile LRA specification - https://github.com/eclipse/microprofile-lra.
The idea you describe is more or less one LRA participant executing in a new LRA and polling the status of this execution. This is not totally what the LRA is intended for, but surely can be used this way.
The main idea of LRA is the composition of distributed transactions based on the saga pattern. Basically, the point is to coordinate multiple services to achieve consistent results with an eventual consistency guarantee. So you see that the main benefit arises when you can propagate LRA through different services that either all complete their actions or all of their compensation callbacks will be called in case of failures (and, of course, only for the services that executed their actions in the first place). Here is also an example with the LRA propagation https://github.com/xstefank/quarkus-lra-trip-example.
EDIT: Sorry, I forgot to add the programmatic API that allows same interactions as annotations - https://github.com/jbosstm/narayana/blob/master/rts/lra/client/src/main/java/io/narayana/lra/client/NarayanaLRAClient.java. However, note that is not in the specification and is only specific to Narayana.

Microservice Architecture: Can you eliminate the synchronous calls between services completely in a system?

Anywhere you read about Microservices, it says microservice should communicate asynchronously. It is understandable why asynchronous communication is preferred as it removes dependencies and provides low-coupling, and availability, etc.
Suppose, there is a common authorization service that is invoked every time a user calls an API. In this scenario you cannot move further util you have the response from the authorization service. Although you can call the authorization service asynchronously using Async IO, however, it is still a request/reply pattern.
Questions I have
Is possible to get rid of synchronous communication or more appropriately request/reply pattern in microservices-based system design?
Although it is possible to implement a reply/response pattern asynchronously through messaging and callbacks, which add significant overhead and latency but is it worth converting every request/reply to asynchronously?
If synchronous calls cannot be eliminated completely, then which scenarios it is ok to have synchronous calls among microservices?
I think the short answer for your question is: request-reply pattern doesn't mean synchronous. It can also be asynchronous. Which you already mentioned.
Long answer:
Request-Reply is just a principle. For example you send an email to a friend. The message contains data relevant to you and you are expecting a response but didn't say that explicitly. Your friend will see the email when he will get back from work and then he may or may not reply to you. Only you know that you need an answer from him.
Now there are a few options while waiting for your response. Either block your entire life until your friend responds (which will mean synchronous communication) either do something else until the response arrives in your inbox (which is asynchronous).
Now, to the point:
Is possible to get rid of synchronous communication or more appropriately request/reply pattern in microservices-based system design?
Yes, you already have answered that at the second point. Even though it is possible I think it should be used where it is required.
Although it is possible to implement a reply/response pattern asynchronously through messaging and callbacks, which add significant overhead and latency but is it worth converting every request/reply to asynchronously?
For the right scenario, yes. The messaging system have very good performances so the latency should not be an issue. When a latency problem occurs in a messaging system there are other options to improve it.
If synchronous calls cannot be eliminated completely, then which scenarios it is ok to have synchronous calls among microservices?
Yes.
There is one more thing that needs to be added. Synchronous doesn't always mean blocking. In a reactive world, if you make an HTTP call to another service the caller sends the request and then awaits for the response in a non-blocking manner. When the responses arrives, the caller is notified the the response has arrived and so the process continues. While "awaiting" the CPU can do other stuff.

Microservices: detecting a failed service ( root of all problems )

I would like to understand how to detect the failed service ( in a fast / reliably manner ), ie the service what is a root of all 5xx responses?
Let me try to elaborate. Lets assume we have 300+ microservices and they have only synchroneous http interaction via GET request without any data modifications ( we assume it for simplicity ). Each customer request may transform in calling ~10 different microservices, moreover it could be a 'calling chain' of requests, ie API Gateway calls 3 different microservices, each of them calls 1-5 more, each of these 1-5 calls 1-5 more etc.
We closely monitor 5xx errors on each of microservice and react on these errors.
Now one of the microservices fails. It appears to be somewhere in the end of a 'calling chain', which means that other microservices which depend on it will start to return 5xx as well.
Yes, there are circuit breakers, yes they become 'triggered / opened' and instead of calling the downstream service, they right away return error as well ( in most cases we cannot return a good fallback like empty response ).
So we see that relatively big amount of microservices return 5xx. Like 30-40 microservices return 5xx, we see 30-40 triggered / opened circuit breakers.
How to detect a failed microservice, a root of all evil, in a fast manner?
Did anybody encounter this issue?
Regards
You will need to implement a distributed tracing solution that tracks the origin transaction with a global ID. The name of this global identifier is typically called Correlation ID and it is generated by the very first service which creates the request and propagated to all the other microservices that work together to fulfill the request.
Take a look at OpenTracing for your implementation needs. It provides libraries for you to add the instrumentation required for identifying faulty microservices in a distributed environment.
However, if you really do have 300 microservices all using synchronous calls...maybe it is time to consider using asynchronous communications to eliminate the temporal coupling inherent in synchronous communications.

How to handle microservice Interaction when one of the microservice is down

I am new to microservice architecture. Currently I am using spring boot for my microservices, in case one of the microservice is down how should fail over mechanism work ?
For Ex. if we have 3 microservices M1,M2,M3 . M1 is interacting with M2 and M2 is interacting with M3 . In case M2 microservice cluster is down how should we handle this situation?
When any one of the microservice is down, Interaction between services becomes very critical as isolation of failure, resilience and fault tolerance are some of key characteristics for any microservice based architecture.
Totally agreed what #jayant had answered, in your case Implementing proper fallback mechanism makes more sense and you can implement required logic you wanna write based on use case and dependencies between M1, M2 and M3.
you can also raise events in your fallback if needed.
Since you are new to microservice, you need to know below common techniques and architecture patterns for resilience and fault tolerance against the situation which you have raised in your question. And here you are using Spring-Boot, you can easily add Netflix-OSS in your microservices.
Netflix has released Hystrix, a library designed to control points of access to remote systems, services and 3rd party libraries, providing greater tolerance of latency and failure.
It include below important characteristics:
Importance of Circuit breaker and Fallback Mechanism:
Hystrix implements the circuit breaker pattern which is useful when a
service failure can cause cascading failure all the way up to the user.
When calls to a particular service exceed
circuitBreaker.requestVolumeThreshold (default: 20 requests) and the
failure percentage is greater than
circuitBreaker.errorThresholdPercentage (default: >50%) in a rolling
window defined by metrics.rollingStats.timeInMilliseconds (default: 10
seconds), the circuit opens and further calls are not made.
In cases of error and an open circuit, a fallback can be provided by the
developer. Fallbacks may be chained so that the first fallback makes
some other business call. check out Fallback Implementation of Hystrix
Retry:
When a request fails, you may want to have the request be retried
automatically. Ribbon does this job for us.
In distributed system, a microservices system retry can trigger multiple
other requests or retries and start a cascading effect
here are some properties to look of Ribbon
sample-client.ribbon.MaxAutoRetries=1
Max number of next servers to retry (excluding the first server)
sample-client.ribbon.MaxAutoRetriesNextServer=1
Whether all operations can be retried for this client
sample-client.ribbon.OkToRetryOnAllOperations=true
Interval to refresh the server list from the source
sample-client.ribbon.ServerListRefreshInterval=2000
More details :- ribbon properties
Bulkhead Pattern:
In general, the goal of the bulkhead pattern is to avoid faults in one
part of a system to take the entire system down. bulkhead pattern
The bulkhead implementation in Hystrix limits the number of concurrent
calls to a component. This way, the number of resources (typically
threads) that is waiting for a reply from the component is limited.
Assume you have a request based, multi threaded application (for example
a typical web application) that uses three different components, M1, M2,
and M3. If requests to component M3 starts to hang, eventually all
request handling threads will hang on waiting for an answer from M3.
This would make the application entirely non-responsive. If requests to
M3 is handled slowly we have a similar problem if the load is high
enough.
Implementation details can be found here
So, These are some factors you need to consider while handling microservice Interaction when one of the microservice is down.
As mentioned in the comment, there are many ways you can go about it,
case 1: all are independent services, trivial case, no need to do anything, call all the services in blocking or non-blocking way, calling service 2 will in both case result in timeout
case 2: services are dependent M2 depends on M1 and M3 depends on M2
option a) M1 can wait for service M2 to come back up, doing periodic pings or fetching details from registry or naming server if M2 is up or not
option b) use hystrix as a circuit breaker implementation and handle fallback gracefully in M3 or your orchestrator(guy who is calling these services i.e M1,M2,M3 in order)

Resources