How to handle microservice Interaction when one of the microservice is down - spring-boot

I am new to microservice architecture. Currently I am using spring boot for my microservices, in case one of the microservice is down how should fail over mechanism work ?
For Ex. if we have 3 microservices M1,M2,M3 . M1 is interacting with M2 and M2 is interacting with M3 . In case M2 microservice cluster is down how should we handle this situation?

When any one of the microservice is down, Interaction between services becomes very critical as isolation of failure, resilience and fault tolerance are some of key characteristics for any microservice based architecture.
Totally agreed what #jayant had answered, in your case Implementing proper fallback mechanism makes more sense and you can implement required logic you wanna write based on use case and dependencies between M1, M2 and M3.
you can also raise events in your fallback if needed.
Since you are new to microservice, you need to know below common techniques and architecture patterns for resilience and fault tolerance against the situation which you have raised in your question. And here you are using Spring-Boot, you can easily add Netflix-OSS in your microservices.
Netflix has released Hystrix, a library designed to control points of access to remote systems, services and 3rd party libraries, providing greater tolerance of latency and failure.
It include below important characteristics:
Importance of Circuit breaker and Fallback Mechanism:
Hystrix implements the circuit breaker pattern which is useful when a
service failure can cause cascading failure all the way up to the user.
When calls to a particular service exceed
circuitBreaker.requestVolumeThreshold (default: 20 requests) and the
failure percentage is greater than
circuitBreaker.errorThresholdPercentage (default: >50%) in a rolling
window defined by metrics.rollingStats.timeInMilliseconds (default: 10
seconds), the circuit opens and further calls are not made.
In cases of error and an open circuit, a fallback can be provided by the
developer. Fallbacks may be chained so that the first fallback makes
some other business call. check out Fallback Implementation of Hystrix
Retry:
When a request fails, you may want to have the request be retried
automatically. Ribbon does this job for us.
In distributed system, a microservices system retry can trigger multiple
other requests or retries and start a cascading effect
here are some properties to look of Ribbon
sample-client.ribbon.MaxAutoRetries=1
Max number of next servers to retry (excluding the first server)
sample-client.ribbon.MaxAutoRetriesNextServer=1
Whether all operations can be retried for this client
sample-client.ribbon.OkToRetryOnAllOperations=true
Interval to refresh the server list from the source
sample-client.ribbon.ServerListRefreshInterval=2000
More details :- ribbon properties
Bulkhead Pattern:
In general, the goal of the bulkhead pattern is to avoid faults in one
part of a system to take the entire system down. bulkhead pattern
The bulkhead implementation in Hystrix limits the number of concurrent
calls to a component. This way, the number of resources (typically
threads) that is waiting for a reply from the component is limited.
Assume you have a request based, multi threaded application (for example
a typical web application) that uses three different components, M1, M2,
and M3. If requests to component M3 starts to hang, eventually all
request handling threads will hang on waiting for an answer from M3.
This would make the application entirely non-responsive. If requests to
M3 is handled slowly we have a similar problem if the load is high
enough.
Implementation details can be found here
So, These are some factors you need to consider while handling microservice Interaction when one of the microservice is down.

As mentioned in the comment, there are many ways you can go about it,
case 1: all are independent services, trivial case, no need to do anything, call all the services in blocking or non-blocking way, calling service 2 will in both case result in timeout
case 2: services are dependent M2 depends on M1 and M3 depends on M2
option a) M1 can wait for service M2 to come back up, doing periodic pings or fetching details from registry or naming server if M2 is up or not
option b) use hystrix as a circuit breaker implementation and handle fallback gracefully in M3 or your orchestrator(guy who is calling these services i.e M1,M2,M3 in order)

Related

What would be the right ZMQ Pattern?

I am trying to build a ZeroMQ pattern where,
There can be many clients connecting to a single server endpoint
Server will distribute incoming client tasks to available workers (will be mapped to the number of cores on the server)
These tasks are long running (in hours) and need to perform a lot of local I/O
During each task execution (iteration) there will be data/messages (potentially in order of [GB]s) sent back and forth between the client and the server worker
Client and server workers need to know if there are failures/errors on the peer side, so that they can recover (retry) or shutdown gracefully and try later
Based on the above, I presume that the ROUTER/DEALER pattern would be useful. PUB/SUB is discarded as I need to know if the peer fails.
I tried using various combinations of the ROUTER/DEALER pattern but I am unable to ensure that multiple messages from a client reach the same worker within an iteration. I understand that I need to implement a broker/forwarder/device that routes the incoming messages to the right recipient/handler/worker. But I am unable to map the frontend and backend sockets in the broker. I am looking at MajorDomo pattern, but I guess there has to be a simpler broker model that could just route the messages to the assigned worker. (not really get into services)
I am looking for some examples, if there are any or any guidance on what I may be missing. I am trying to build this in Golang.
Q : "What would be the right ZMQ Pattern?"
Based on the complex composition of all the requirements posted under items 1 - 5, I dare to say, The Right would be NOT to use a single one of the standard, built-in, ZeroMQ trivial primitive Communication Archetype Patterns, but to rather create a multi-layered application-specific composition of a ( M + N + 1 hot-standby robust-enough?) (self-resilient?) Signalling-Messaging infrastructure, that covers all your current ( and possibly extensible for any future one ) application-level requirements, like depicted here for a way simpler distributed-computing use-case, where but a trivial remote-SigKILL was implemented.
Yes, the best would be to create ( and maintain ) your own formalised signalling, that the application level can handle and interact across -- like the heart-beating for detecting dead-worker(s) + permitting to re-instate such failed jobs right on-detected failures (most probably re-located and/or re-scheduled to take place & respective resources not statically pre-mapped, but where physically most feasible at the re-instating moment of time - so even more telemetry signalling will help you decide about the re-instating of the such failed micro-jobs).
ZeroMQ is a fabulous framework right for such complex signalling and messaging hierarchies, so your System Architect's imagination is the only ceiling in this concept.
ZeroMQ will take the rest and do all the hard work nice and easily.

What is the difference between a circuit breaker and a bulkhead pattern?

Can we use both together in Spring Boot during the development of microservice?
These are fundamentally different patterns.
A circuit breaker pattern is implemented on the caller, to avoid overwhelming a service which may be struggling to handle calls. A sample implementation in Spring can be found here.
A bulkhead pattern is implemented on the service, to prevent a failure during the handling of a single incoming call impacting the handling of other incoming calls. A sample implementation in Spring can be found here.
The only thing these patters have in common is that they are both designed to increase the resilience of a distributed system.
While you can certainly use them together in the same service, you must understand that they are not related to each other, as one is concerned with making calls and the other is concerned with handling calls.
Yes, they can be used together, but it's not always necessary.
As #tom redfern said, circuit breaker is implemented on the caller side. So, if you are sending request to another service, you should wrap those requests into a circuit breaker specific to that service. Keep in mind that every other third party system or service should have it's own circuit breaker. Otherwise, the unavailability of one system will impact the requests that you are sending to the other by opening the circuit breaker.
More informations about circuit breaker can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
Also, #tom redfern is right again in the case of bulkheading, this is a pattern which is implemented in the service that is called. So, if you are reacting to external requests by spanning other multiple requests or worloads, you should avoid doing all those worloads into a single unit (thread). Instead, separate the worloads into pieces (thread pools) for each request that you have spanned.
More information about bulkheading can be found here: https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead
Your question was if it's possible to use both these patterns in the same microservice. The answer is: yes, you can and very often the situation implies this.

Microservices: detecting a failed service ( root of all problems )

I would like to understand how to detect the failed service ( in a fast / reliably manner ), ie the service what is a root of all 5xx responses?
Let me try to elaborate. Lets assume we have 300+ microservices and they have only synchroneous http interaction via GET request without any data modifications ( we assume it for simplicity ). Each customer request may transform in calling ~10 different microservices, moreover it could be a 'calling chain' of requests, ie API Gateway calls 3 different microservices, each of them calls 1-5 more, each of these 1-5 calls 1-5 more etc.
We closely monitor 5xx errors on each of microservice and react on these errors.
Now one of the microservices fails. It appears to be somewhere in the end of a 'calling chain', which means that other microservices which depend on it will start to return 5xx as well.
Yes, there are circuit breakers, yes they become 'triggered / opened' and instead of calling the downstream service, they right away return error as well ( in most cases we cannot return a good fallback like empty response ).
So we see that relatively big amount of microservices return 5xx. Like 30-40 microservices return 5xx, we see 30-40 triggered / opened circuit breakers.
How to detect a failed microservice, a root of all evil, in a fast manner?
Did anybody encounter this issue?
Regards
You will need to implement a distributed tracing solution that tracks the origin transaction with a global ID. The name of this global identifier is typically called Correlation ID and it is generated by the very first service which creates the request and propagated to all the other microservices that work together to fulfill the request.
Take a look at OpenTracing for your implementation needs. It provides libraries for you to add the instrumentation required for identifying faulty microservices in a distributed environment.
However, if you really do have 300 microservices all using synchronous calls...maybe it is time to consider using asynchronous communications to eliminate the temporal coupling inherent in synchronous communications.

How to handle events processing time between services

Let's say we have two services A and B. B has a relation to A so it needs to know about the existing entities of A.
Service A publishes events every time an entity is created or updated. Service B subscribes to the events published by A and therefore knows about the entities existing in service A.
Problem: The client (UI or other micro services) creates a new entity 'a' and right away creates a new entity 'b' with a reference to 'a'. This is done without much delay so what happens if service B did not receive/handle the event from B before getting the create request with a reference to 'b'?
How should this be handled?
Service B must fail and the client should handle this and possibly do retry.
Service B accepts the entity and over time expect the relation to be fulfilled when the expected event is received. Service B provides a state for the entity that ensures it cannot be trusted before the relation have been verified.
It is poor design that the client can/has to do these two calls in the same transaction. The design should be different. How?
Other ways?
I know that event platforms like Kafka ensures very fast event transmittance but there will always be a delay and since this is an asynchronous process there will be kind of a race condition.
What you're asking about falls under the general category of bridging the gap between Eventual Consistency and good User Experience which is a well-documented challenge with a distributed architecture. You have to choose between availability and consistency; typically you cannot have both.
Your example raises the question as to whether service boundaries are appropriate. It's a common mistake to define microservice boundaries around Entities, but that's an anti-pattern. Microservice boundaries should be consistent with domain boundaries related to the business use case, not how entities are modeled within those boundaries. Here's a good article that discusses decomposition, but the TL;DR; is:
Microservices should be verbs, not nouns.
So, for example, you could have a CreateNewBusinessThing microservice that handles this specific case. But, for now, we'll assume you have good and valid reasons to have the services divided as they are.
The "right" solution in your case depends on the needs of the consuming service/application. If the consumer is an application or User Interface of some sort, responsiveness is required and that becomes your overriding need. If the consumer is another microservice, it may well be that it cares more about getting good "finalized" data rather than being responsive.
In either of those cases, one good option is a facade (aka gateway) service that lives between your client and the highly-dependent services. This service can receive and persist the request, then respond however you'd like. It can give the consumer a 200 - OK response with an endpoint to call back to check status of the request - very responsive. Or, it could receive a URL to use as a webhook when the response is completed from both back-end services, so it could notify the client directly. Or it could publish events of its own (it likely should). Essentially, you can tailor the facade service to provide to as many consumers as needed in the way each consumer wants to talk.
There are other options too. You can look into Task-Based UI, the Saga pattern, or even just Faking It.
I think you would like to leverage the flexibility of a broker and the confirmation of a synchronous call . Both of them can be achieved by this
https://www.rabbitmq.com/tutorials/tutorial-six-dotnet.html

Is hystrix with circuit breaker good for call forwarding and high TPS case?

I have three services A, B, and C. A receives calls from two sources, and forwards most calls to B service, some to C, and handles a few based on URIs. Before forwarding calls to B or C, A does a little trivial work. The peak requests per second handled by service A is about 60. Out of 60, 55 API calls are transferred to service B. We know two to three high frequency APIs of service B. Please note that all calls are synchronous in nature.
I am using Spring Boot 1.4.1 and Spring Cloud Camden.RELEASE. As per my experiments using JMeter on local Windows machine, I see services are able to handle the expected requests per second. Once I make the service A as circuit breaker and wrap the high frequency API calls with #HystrixCommand, I see performance becomes poorer than what was before. Many API calls are failing by hystrix and fallbacks are called. Then upon increasing execution.isolation.thread.timeoutInMilliseconds command property value to "30000" and coreSize thread pool property value to "50", all calls have passed. I have observed that with hystrix stuff enabled, service A needs ~50 threads more than before. With more load, API execution time becomes higher and hence have increased the timeout property value.
Wondering
if making the service A as circuit breaker and wrapping the high
frequency calls (or all calls) to service B inside service A with
hystrix command are good decisions
if yes, is it not bad to change/increase thread counts manually
through configuration in hystrix pool based on more TPS need in
future? Without hystrix the situation is simple as spring boot
automatically handles thread pools for serving load
As I need to modify the timeout property, now when the service B is
stopped, A or hystrix takes some seconds to detect the service B is
unreachable. The real advantage of using hystrix to stop cascading
exhaust or stop service is not much. Still hystrix recommended?
Netflix recommends core size to be 10 default mostly and they have
used till 25, not beyond. In my case, the need is 50
Your suggestions will be helpful here, especially to know if hystrix with circuit breaker is useful in my case, or how we can make it useful, or where else it is more suitable (where TPS to any service is low).
Base on the config doc:
Generally the only time you should use semaphore isolation (SEMAPHORE) is when the call is so high volume (hundreds per second, per instance) that the overhead of separate threads is too high;

Resources