Microservices: detecting a failed service ( root of all problems ) - microservices

I would like to understand how to detect the failed service ( in a fast / reliably manner ), ie the service what is a root of all 5xx responses?
Let me try to elaborate. Lets assume we have 300+ microservices and they have only synchroneous http interaction via GET request without any data modifications ( we assume it for simplicity ). Each customer request may transform in calling ~10 different microservices, moreover it could be a 'calling chain' of requests, ie API Gateway calls 3 different microservices, each of them calls 1-5 more, each of these 1-5 calls 1-5 more etc.
We closely monitor 5xx errors on each of microservice and react on these errors.
Now one of the microservices fails. It appears to be somewhere in the end of a 'calling chain', which means that other microservices which depend on it will start to return 5xx as well.
Yes, there are circuit breakers, yes they become 'triggered / opened' and instead of calling the downstream service, they right away return error as well ( in most cases we cannot return a good fallback like empty response ).
So we see that relatively big amount of microservices return 5xx. Like 30-40 microservices return 5xx, we see 30-40 triggered / opened circuit breakers.
How to detect a failed microservice, a root of all evil, in a fast manner?
Did anybody encounter this issue?
Regards

You will need to implement a distributed tracing solution that tracks the origin transaction with a global ID. The name of this global identifier is typically called Correlation ID and it is generated by the very first service which creates the request and propagated to all the other microservices that work together to fulfill the request.
Take a look at OpenTracing for your implementation needs. It provides libraries for you to add the instrumentation required for identifying faulty microservices in a distributed environment.
However, if you really do have 300 microservices all using synchronous calls...maybe it is time to consider using asynchronous communications to eliminate the temporal coupling inherent in synchronous communications.

Related

What would be the right ZMQ Pattern?

I am trying to build a ZeroMQ pattern where,
There can be many clients connecting to a single server endpoint
Server will distribute incoming client tasks to available workers (will be mapped to the number of cores on the server)
These tasks are long running (in hours) and need to perform a lot of local I/O
During each task execution (iteration) there will be data/messages (potentially in order of [GB]s) sent back and forth between the client and the server worker
Client and server workers need to know if there are failures/errors on the peer side, so that they can recover (retry) or shutdown gracefully and try later
Based on the above, I presume that the ROUTER/DEALER pattern would be useful. PUB/SUB is discarded as I need to know if the peer fails.
I tried using various combinations of the ROUTER/DEALER pattern but I am unable to ensure that multiple messages from a client reach the same worker within an iteration. I understand that I need to implement a broker/forwarder/device that routes the incoming messages to the right recipient/handler/worker. But I am unable to map the frontend and backend sockets in the broker. I am looking at MajorDomo pattern, but I guess there has to be a simpler broker model that could just route the messages to the assigned worker. (not really get into services)
I am looking for some examples, if there are any or any guidance on what I may be missing. I am trying to build this in Golang.
Q : "What would be the right ZMQ Pattern?"
Based on the complex composition of all the requirements posted under items 1 - 5, I dare to say, The Right would be NOT to use a single one of the standard, built-in, ZeroMQ trivial primitive Communication Archetype Patterns, but to rather create a multi-layered application-specific composition of a ( M + N + 1 hot-standby robust-enough?) (self-resilient?) Signalling-Messaging infrastructure, that covers all your current ( and possibly extensible for any future one ) application-level requirements, like depicted here for a way simpler distributed-computing use-case, where but a trivial remote-SigKILL was implemented.
Yes, the best would be to create ( and maintain ) your own formalised signalling, that the application level can handle and interact across -- like the heart-beating for detecting dead-worker(s) + permitting to re-instate such failed jobs right on-detected failures (most probably re-located and/or re-scheduled to take place & respective resources not statically pre-mapped, but where physically most feasible at the re-instating moment of time - so even more telemetry signalling will help you decide about the re-instating of the such failed micro-jobs).
ZeroMQ is a fabulous framework right for such complex signalling and messaging hierarchies, so your System Architect's imagination is the only ceiling in this concept.
ZeroMQ will take the rest and do all the hard work nice and easily.

Message Based Microservices - Api Gateway Performance

I'm in the process of designing a micro-service architecture and I have a performance related question. This is what I am trying out with my design:
I have a several micro-services which perform distinct actions and store those results in their own data-store.
The micro-services receive work via a message queue where they receive requests to run their process for the specific data given. The micro-services do NOT communicate with each other.
I have an API gateway which effectively has three journeys:
1) Receive a request to process data which it then translates into several messages which it puts on the queue for the micro-services to process in their own time. The processing time can be in minutes or longer (not-instant)
2) Receives a request for the status of the process, where it returns the progress of the overall process.
3) Receives a request for combined data, which is some combination of all the results from the services.
My problem lies in #3 above and the performance of this process.
Whenever this request is received, the api gateway has to put a message request onto the queue for information from all the services, it than has to wait for all the services to reply with the latest state of their data and then it combines this data and returns to the caller.
This process is obviously rather slow as it has to wait for every service to respond. What is the way of speeding this up?
The only way I thought of solving this is having another aggregate service/data-store where duplicate data is stored and queried by my api gateway. I really don't like this approach as it duplicates data and is extra work/code.
What is the 'correct' and performant way of querying up-to-date data from my micro-services.
You can use these approach for Querying data across microservices. Reference
Selective data replication
With this approach, we replicate the data needed from other microservices into the database of our microservice. The only coupling between microservices is in the data replication configuration.
Composite service layer
With this approach, you introduce composite services that aggregate data from lower-level microservices.

How to handle events processing time between services

Let's say we have two services A and B. B has a relation to A so it needs to know about the existing entities of A.
Service A publishes events every time an entity is created or updated. Service B subscribes to the events published by A and therefore knows about the entities existing in service A.
Problem: The client (UI or other micro services) creates a new entity 'a' and right away creates a new entity 'b' with a reference to 'a'. This is done without much delay so what happens if service B did not receive/handle the event from B before getting the create request with a reference to 'b'?
How should this be handled?
Service B must fail and the client should handle this and possibly do retry.
Service B accepts the entity and over time expect the relation to be fulfilled when the expected event is received. Service B provides a state for the entity that ensures it cannot be trusted before the relation have been verified.
It is poor design that the client can/has to do these two calls in the same transaction. The design should be different. How?
Other ways?
I know that event platforms like Kafka ensures very fast event transmittance but there will always be a delay and since this is an asynchronous process there will be kind of a race condition.
What you're asking about falls under the general category of bridging the gap between Eventual Consistency and good User Experience which is a well-documented challenge with a distributed architecture. You have to choose between availability and consistency; typically you cannot have both.
Your example raises the question as to whether service boundaries are appropriate. It's a common mistake to define microservice boundaries around Entities, but that's an anti-pattern. Microservice boundaries should be consistent with domain boundaries related to the business use case, not how entities are modeled within those boundaries. Here's a good article that discusses decomposition, but the TL;DR; is:
Microservices should be verbs, not nouns.
So, for example, you could have a CreateNewBusinessThing microservice that handles this specific case. But, for now, we'll assume you have good and valid reasons to have the services divided as they are.
The "right" solution in your case depends on the needs of the consuming service/application. If the consumer is an application or User Interface of some sort, responsiveness is required and that becomes your overriding need. If the consumer is another microservice, it may well be that it cares more about getting good "finalized" data rather than being responsive.
In either of those cases, one good option is a facade (aka gateway) service that lives between your client and the highly-dependent services. This service can receive and persist the request, then respond however you'd like. It can give the consumer a 200 - OK response with an endpoint to call back to check status of the request - very responsive. Or, it could receive a URL to use as a webhook when the response is completed from both back-end services, so it could notify the client directly. Or it could publish events of its own (it likely should). Essentially, you can tailor the facade service to provide to as many consumers as needed in the way each consumer wants to talk.
There are other options too. You can look into Task-Based UI, the Saga pattern, or even just Faking It.
I think you would like to leverage the flexibility of a broker and the confirmation of a synchronous call . Both of them can be achieved by this
https://www.rabbitmq.com/tutorials/tutorial-six-dotnet.html

How to handle microservice Interaction when one of the microservice is down

I am new to microservice architecture. Currently I am using spring boot for my microservices, in case one of the microservice is down how should fail over mechanism work ?
For Ex. if we have 3 microservices M1,M2,M3 . M1 is interacting with M2 and M2 is interacting with M3 . In case M2 microservice cluster is down how should we handle this situation?
When any one of the microservice is down, Interaction between services becomes very critical as isolation of failure, resilience and fault tolerance are some of key characteristics for any microservice based architecture.
Totally agreed what #jayant had answered, in your case Implementing proper fallback mechanism makes more sense and you can implement required logic you wanna write based on use case and dependencies between M1, M2 and M3.
you can also raise events in your fallback if needed.
Since you are new to microservice, you need to know below common techniques and architecture patterns for resilience and fault tolerance against the situation which you have raised in your question. And here you are using Spring-Boot, you can easily add Netflix-OSS in your microservices.
Netflix has released Hystrix, a library designed to control points of access to remote systems, services and 3rd party libraries, providing greater tolerance of latency and failure.
It include below important characteristics:
Importance of Circuit breaker and Fallback Mechanism:
Hystrix implements the circuit breaker pattern which is useful when a
service failure can cause cascading failure all the way up to the user.
When calls to a particular service exceed
circuitBreaker.requestVolumeThreshold (default: 20 requests) and the
failure percentage is greater than
circuitBreaker.errorThresholdPercentage (default: >50%) in a rolling
window defined by metrics.rollingStats.timeInMilliseconds (default: 10
seconds), the circuit opens and further calls are not made.
In cases of error and an open circuit, a fallback can be provided by the
developer. Fallbacks may be chained so that the first fallback makes
some other business call. check out Fallback Implementation of Hystrix
Retry:
When a request fails, you may want to have the request be retried
automatically. Ribbon does this job for us.
In distributed system, a microservices system retry can trigger multiple
other requests or retries and start a cascading effect
here are some properties to look of Ribbon
sample-client.ribbon.MaxAutoRetries=1
Max number of next servers to retry (excluding the first server)
sample-client.ribbon.MaxAutoRetriesNextServer=1
Whether all operations can be retried for this client
sample-client.ribbon.OkToRetryOnAllOperations=true
Interval to refresh the server list from the source
sample-client.ribbon.ServerListRefreshInterval=2000
More details :- ribbon properties
Bulkhead Pattern:
In general, the goal of the bulkhead pattern is to avoid faults in one
part of a system to take the entire system down. bulkhead pattern
The bulkhead implementation in Hystrix limits the number of concurrent
calls to a component. This way, the number of resources (typically
threads) that is waiting for a reply from the component is limited.
Assume you have a request based, multi threaded application (for example
a typical web application) that uses three different components, M1, M2,
and M3. If requests to component M3 starts to hang, eventually all
request handling threads will hang on waiting for an answer from M3.
This would make the application entirely non-responsive. If requests to
M3 is handled slowly we have a similar problem if the load is high
enough.
Implementation details can be found here
So, These are some factors you need to consider while handling microservice Interaction when one of the microservice is down.
As mentioned in the comment, there are many ways you can go about it,
case 1: all are independent services, trivial case, no need to do anything, call all the services in blocking or non-blocking way, calling service 2 will in both case result in timeout
case 2: services are dependent M2 depends on M1 and M3 depends on M2
option a) M1 can wait for service M2 to come back up, doing periodic pings or fetching details from registry or naming server if M2 is up or not
option b) use hystrix as a circuit breaker implementation and handle fallback gracefully in M3 or your orchestrator(guy who is calling these services i.e M1,M2,M3 in order)

How to limit rate of out going http calls in scaled microservice?

I have a scenario in which my microservice is scaled to 3 instances. Each service makes http calls to third party service. However, the third party service has a rate limit i.e. it cannot accept more than 1000 requests per second. Now that I have 3 instances of same service running its hard to keep track of count. Any solutions that could help me implement this?
You can use Circuit Breaker pattern and tools like Hystrix in such a scenario.
My answer is based on assumption that each service is independent and don't interact with each others and can possibly scaled up or down.
Use Redis data cache service, introduce a variable there. Each service will be able to refer that variable and will update when ever they make a API call, write some conditions so no service is allow to make calls if its reach to 1000 for that specific second .
Hence they will not be able to make more than 1000 call per seconds.

Resources