Is it recommended or good idea to have 2 diff./same circuit breaker(resilience4j) per message/api call? - circuit-breaker

In my micro-service, circuit breaker is present at a layer where external api call happens and my recorded exceptions are those which can happen during this external api call.
But my service timeout is configured at a layer way above this(where circuit breaker is present/configured) layer. I can't move circuit breaker up to the layer where timeout is configured or vice-versa.
Basically, I want to record this timeout exception happening at different layer.
Is it recommended or good idea to have 2 diff./same circuit breaker per message/api call?

Check this for answer on github repo by the creator himself: https://github.com/resilience4j/resilience4j/issues/1060:
Yes, you can do that.
But you have to think about how exceptions are progapated through two CircuitBreakers.

Related

Saga Pattern on hardware failure and inter services communication

I am building a Spring Boot microservice application. I am planning on adopting the Saga pattern to tackle the distributed transaction problem. Below is the list of questions and problems that I am facing.
Here is the context for ease of explanation.
Client -> Service A -> Service B
Handling of non-alive microservices due to failure
Assuming that Service B is not alive due to hardware / software failure, how should A react?
Async communication
It is recommended that we have async communication for saga pattern. Assuming that time for client -> A < A -> B, how does the Client receive the data that A receives from B at a later time? Is it that A has to return an Async object back to client? Something like CompletableFuture class?
Service requesting resources from other services.
Assuming that Service A has to request some resources from Service B, how should A go about doing this? All I can think of is using HTTP / gRPC (eliminated communication from message broker).
If you happened to have some experience / advice, please share :)
Any help or advice on Saga pattern is appreciated!
SAGA is used for distributed transaction. It can be implemented by using Orchestration or Choreography based. It is mostly (prefer) implemented by using async way of communication. Message Broker plays important role here.
There are lots of queries. Let me try to answer those.
If one service is down - You can setup a monitoring system for SAGA. In case, if any service is down or SAGA is not processed for some threshold time then you can raise alert.
Async Communication - It is mostly used to process some commands (not query). Whenever client call service A, it initiate the SAGA and reply back with current status. It also return a id (you can say job id). Now there are 2 ways through which Client get updated status. One is Poll (where client ask for status update after N sec) and 2nd is Push (where server push the changes when there is change in state.)
Service request resource from other - Yeah, prefer way is REST or gRPC. Also, if data is type of constant then you can use cache.
Suggestion - SRE (Monitoring etc.) play an important role in Microservice architecture. So, if you have setup that well then you can easily handle other challenges of microservice.

How to limit rate of outgoing REST API request in scaled microservice environment

I have a scenario in which my Spring boot microservice is scaled to 8 instances. Each service consume the message from MQ and makes a http call to third party service. However, the third party service has a rate limit i.e. it cannot accept more than 20 requests per second. Now that I have 8 instances of same service running its hard to keep track of count. Any solutions that could help me implement this in autoscale environment ?
I wouldn't advice to keep track of that state because that's virtually impossible.
Have a look at Circuit Breakers which is included in Spring Cloud. They can add behaviour to outgoing calls including some retry and backoff settings, or return stubs if all else fails. Implementations include Spring-Retry and Netflix Hystrix among others.
I guess it's still not 100% fault-tolerant but as you already use messaging that won't be an issue because if all retries fail you can nack the message.
There's also this introduction from Martin Fowler to Circuit Breakers which is really nice.
Hope this might give you something new to concider.
This blog post comes to my mind. They use a Token Bucket to control the flow. The use case sounds similar to yours.
Our connection to our SMS aggregator requires us to limit the rate that we send messages to their system.

Microservices: how to track fallen down services?

Problem:
Suppose there are two services A and B. Service A makes an API call to service B.
After a while service A falls down or to be lost due to network errors.
How another services will guess that an outbound call from service A is lost / never happen? I need some another concurrent app that will automatically react (run emergency code) if service A outbound CALL is lost.
What are cutting-edge solutions exist?
My thoughts, for example:
service A registers a call event in some middleware (event info, "running" status, timestamp, etc).
If this call is not completed after N seconds, some "call timeout" event in the middleware automatically starts the emergency code.
If the call is completed at the proper time service A marks the call status as "completed" in the same middleware and the emergency code will not be run.
P.S. I'm on Java stack.
Thanks!
I recommend to look into patterns such as Retry, Timeout, Circuit Breaker, Fallback and Healthcheck. Or you can also look into the Bulkhead pattern if concurrent calls and fault isolation are your concern.
There are many resources where these well-known patterns are explained, for instance:
https://www.infoworld.com/article/3310946/how-to-build-resilient-microservices.html
https://blog.codecentric.de/en/2019/06/resilience-design-patterns-retry-fallback-timeout-circuit-breaker/
I don't know which technology stack you are on but usually there is already some functionality for these concerns provided already that you can incorporate into your solution. There are libraries that already take care of this resilience functionality and you can, for instance, set it up so that your custom code is executed when some events such as failed retries, timeouts, activated circuit breakers, etc. occur.
E.g. for the Java stack Hystrix is widely used, for .Net you can look into Polly .Net to make use of retry, timeout, circuit breaker, bulkhead or fallback functionality.
Concerning health checks you can look into Actuator for Java and .Net core already provides a health check middleware that more or less provides that functionality out-of-the box.
But before using any libraries I suggest to first get familiar with the purpose and concepts of the listed patterns to choose and integrate those that best fit your use cases and major concerns.
Update
We have to differentiate between two well-known problems here:
1.) How can service A robustly handle temporary outages of service B (or the network connection between service A and B which comes down to the same problem)?
To address the related problems the above mentioned patterns will help.
2.) How to make sure that the request that should be sent to service B will not get lost if service A itself goes down?
To address this kind of problem there are different options at hand.
2a.) The component that performed the request to service A (which than triggers service B) also applies the resilience patterns mentioned and will retry its request until service A successfully answers that it has performed its tasks (which also includes the successful request to service B).
There can also be several instances of each service and some kind of load balancer in front of these instances which will distribute and direct the requests to an available instance (based on regular performed healthchecks) of the specific service. Or you can use a service registry (see https://microservices.io/patterns/service-registry.html).
You can of course chain several API calls after another but this can lead to cascading failures. So I would rather go with an asynchronous communication approach as described in the next option.
2b.) Let's consider that it is of utmost importance that some instance of service A will reliably perform the request to service B.
You can use message queues in this case as follows:
Let's say you have a queue where jobs to be performed by service A are collected.
Then you have several instances of service A running (see horizontal scaling) where each instance will consume the same queue.
You will use message locking features by the message queue service which makes sure that as soon one instance of service A reads a message from the queue the other instances won't see it. If service A was able to complete it's job (i.e. call service B, save some state in service A's persistence and whatever other tasks you need to be included for a succesfull procesing) it will delete the message from the queue afterwards so no other instance of service A will also process the same message.
If service A goes down during the processing the queue service will automatically unlock the message for you and another instance A (or the same instance after it has restarted) of service A will try to read the message (i.e. the job) from the queue and try to perform all the tasks (call service B, etc.)
You can combine several queues e.g. also to send a message to service B asynchronously instead of directly performing some kind of API call to it.
The catch is, that the queue service is some highly available and redundant service which will already make sure that no message is getting lost once published to a queue.
Of course you also could handle jobs to be performed in your own database of service A but consider that when service A receives a request there is always a chance that it goes down before it can save that status of the job to it's persistent storage for later processing. Queue services already address that problem for you if chosen thoughtfully and used correctly.
For instance, if look into Kafka as messaging service you can look into this stack overflow answer which relates to the problem solution when using this specific technology: https://stackoverflow.com/a/44589842/7730554
There is many way to solve your problem.
I guess you are talk about 2 topics Design Pattern in Microservices and Cicruit Breaker
https://dzone.com/articles/design-patterns-for-microservices
To solve your problem, Normally I put a message queue between services and use Service Discovery to detect which service is live and If your service die or orverload then use Cicruit Breaker methods

Microservice Circuit Breaker and Discovery Service patterns

I'm new to microservices and have this doubt that google hasn't really helped me out.
I know that a microservice has to be independent, so even if one of its counterpart goes offline, one should keep working normally.
Having that in mind, I can't really understand circuit breaker or even service discovery, like where should do I put it? Since every call I make to any microservice goes through the circuit breaker, let's say my Circuit Breaker service's server goes offline, so my whole application is doomed until I fix it. How to go around that?
Most importantly, WHERE should I put the Circuit Breaker, in a microservice as well?
You should use the Circuit breaker pattern whenever you have remote calls.
If you don't use it, then in some circumstances (i.e. when some microservices are down) your system would act as it is under a self DOS attack. This situation manifests itself when you have chained synchronous calls. For example, if you have the following: A -> B -> C (A calls B which calls C). If C is not responding and A keeps calling then B could be overwhelmed with managing waiting calls from A and could not respond to legitimate calls from other services that would normally succeed.
The most common place to use the Circuit breaker is in the API Gateway, where most of the remote calls are made (this is it's primary responsibility). You could use the pattern also in clients, to force them stop continuously and repeatedly calling a dead microservice.
Although microservices are independent with regards to resilience (they could function even when other fail), this does not mean that they don't communicate with one another. They may communicate but in an asynchronous manner, i.e. when one microservices wants to update its own local cache with data from another microservice in a background process.

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

Resources