Send a message from one microservice to another in Azure Service Fabric (APIs) - asp.net-web-api

What is the best architecture, using Service Fabric, to guarantee that the message I need to send from Service 1 (mostly API) to Service 2 (mostly API) does not get ever lost (black arrow)?
Ideas:
1
1.a. Make service 1 and 2 stateful services. Is it a bad call to have a stateful Web API?
1.b. Use Reliable Collections to send the message from API code to Service 2.
2
2.a. Make Service 1 and 2 stateless services
2.b. Add a third service
2.c. Send the message over a queuing system (i.e.: Service Bus) from service 1
2.d. To be picked up by the third service. Notice: this third service would also have access to the DB that service 2 (API) has access to. Not an ideal solution for a microservice architecture, right?
3
3.a. Any other ideas?
Keep in mind that the goal is to never lose the message, not even when service 2 is completely down or temporary removed… so no direct calls.
Thanks

I'd introduce a third (Stateful) service that holds a queue, 'service 3'.
Service 1 would enqueue the message. Service 3 would run an infinite loop, trying to deliver the message to service 2.
You could use the pub/sub package for this. Service 1 is the publisher, Service 2 is the subscriber.
(If you rely on an external queue system like Service Bus, you'll lower the overall availability of the system. Service Bus downtime would lead to messages being undeliverable.)

​
I think that there is never completely any solution that is 100% sure to never loose a message between two parties. Even if you had a service bus for instance in between two services, there is always the chance (possibly very small, but never null) that the service bus goes down, or that the communication to the service bus goes down. With that being said, there are of course models that are less likely to very seldom loose a message, but you can't completely get around the fact that you still have to handle errors in the client.
In fact, Service Fabric fault handling is mainly designed around clients retrying communication, rather than having the service or an intermediary do that. There are many reasons for this (I guess) but one is the nature of distributed, replicated, reliable services. If a service primary goes down, a replica picks up the responsibility, but it won't know what the primary was doing right at the moment it died (unless it replicated over it's state, but it might have died even before that). The only one that really knows what it wants to do in this scenario is the client. The client knows what it is doing and can react to different fault scenarios in te service. In Fabric Transport, most know exceptions that could "naturally" occur, such as the service dying or the network cable being cut of by the janitor are actuallt retried automatically. This includes re-resolving the address just in case the service primary was replaced with a secondary.
The same actually goes for a scenario where you introduce a third service or a service bus. What if the network goes down before the message has completely reached the service? In this case only the client knows that something went wrong and what it intended to send. What if it goes down after it reached the service but before the response was sent? In this case the client has to assume the message never reached and try to resend it. This is also why service methods are recommended to be idempotent - the same call can be made a number of times by the same client.
Even if you were to introduce a secondary part, like the service bus, there is still the same risk that the service bus goes down, or more likely, the network connecting to the service bus goes down. So, client needs to retry, and when it has retried a number of times, all it can do is put the message in a queue of failed messages or simply just log it, or throw an exception back to the original caller (in your scenario, the browser).
Ok, that's was me being pessimistic. But it could happen. All of the things above, its just that some are not very likely to happen. But they might happen.
On to your questions:
1) the problem with making a stateless service stateful is that you now have to handle partitions in your caller. You can put up Http listeners for stateful services, but you have to include the partition and replica information in the Uri, and that won't work with the load balancer, so in this case the browser has to select partition when calling the API. Not an ideal solution.
2) yes, you could do this, i.e. introduce something else in between that queues messages for you. There is nothing that says that a Service Bus or a Database is more reliable than a Stateful service with a reliable queue there, it's just up to you to go for what you are most comfortable with. I would go for a Stateful service, just so I can easily keep everything within my SF application. But again, this is not 100% protection from disgruntled janitor with scissors, for that you still need clients that can handle faults.
3) make sure you have a way of handling the errors (retry) and logging or storing the messages that fail (after retries) with the client (Service 1).
3.a) One way would be to have it store it localy on the node it is running and periodically (RunAsync for instance) try to re-run those failed messages. This might be dangerous in the scenario where the node it is running on is completely nuked and looses it data though, that data won't be replicated.
3.b) Another would be to use semantic logging with ETW and include enough data in the events to be able to re-create the message from the logged and build some feature, a manual UI perhaps, where you can re-run it from the logged information. Much like you would retry a failed message on an error queue in a service bus.
3.c) Store the failed messages to anything else (database, service bus, queue) that doesn't fail for the same reasons your communication with Service 2.
My main point here is (and I could maybe have started with that) is that there are plenty of scenarios where only the client knows enough to handle the situation. So, make sure you have a strategy for handling faults in your clients.

Related

Why are people using a message Bus in their code - when to message vs call code

When building an application before scaling to multiple micro services. You have a codebase consisting of services that are decoupled. IE a services no longer depends on another service, not even loosely via a interface. It receives input from a service via a message buss. It has a method receivePaymentRequest but its callee is not the Order service. Its invoked via the message bus, perhaps in the future on another server. But imagine theres no need to run multiple servers at this point.
a order services posts to the message bus payment-request event
the payment services picks up on this message
payment is completed
payment service send a payment-complete event message to the message bus
the order service picks up this message
I"m not thinking about the patterns that enable this to be fault tolerant. But instead when to use this approach since it adds a lot of complexity. So please ignore what i've left out with regards to this
Correct? Is it stupid to implement it like such before scaling to microservices. How does this. Is SOA the step before actual microservices?
When should a class receive/publish on the message buss and when should it depend on a service as a class (even injected via a interface) ?

Saga Pattern on hardware failure and inter services communication

I am building a Spring Boot microservice application. I am planning on adopting the Saga pattern to tackle the distributed transaction problem. Below is the list of questions and problems that I am facing.
Here is the context for ease of explanation.
Client -> Service A -> Service B
Handling of non-alive microservices due to failure
Assuming that Service B is not alive due to hardware / software failure, how should A react?
Async communication
It is recommended that we have async communication for saga pattern. Assuming that time for client -> A < A -> B, how does the Client receive the data that A receives from B at a later time? Is it that A has to return an Async object back to client? Something like CompletableFuture class?
Service requesting resources from other services.
Assuming that Service A has to request some resources from Service B, how should A go about doing this? All I can think of is using HTTP / gRPC (eliminated communication from message broker).
If you happened to have some experience / advice, please share :)
Any help or advice on Saga pattern is appreciated!
SAGA is used for distributed transaction. It can be implemented by using Orchestration or Choreography based. It is mostly (prefer) implemented by using async way of communication. Message Broker plays important role here.
There are lots of queries. Let me try to answer those.
If one service is down - You can setup a monitoring system for SAGA. In case, if any service is down or SAGA is not processed for some threshold time then you can raise alert.
Async Communication - It is mostly used to process some commands (not query). Whenever client call service A, it initiate the SAGA and reply back with current status. It also return a id (you can say job id). Now there are 2 ways through which Client get updated status. One is Poll (where client ask for status update after N sec) and 2nd is Push (where server push the changes when there is change in state.)
Service request resource from other - Yeah, prefer way is REST or gRPC. Also, if data is type of constant then you can use cache.
Suggestion - SRE (Monitoring etc.) play an important role in Microservice architecture. So, if you have setup that well then you can easily handle other challenges of microservice.

Microservices: how to track fallen down services?

Problem:
Suppose there are two services A and B. Service A makes an API call to service B.
After a while service A falls down or to be lost due to network errors.
How another services will guess that an outbound call from service A is lost / never happen? I need some another concurrent app that will automatically react (run emergency code) if service A outbound CALL is lost.
What are cutting-edge solutions exist?
My thoughts, for example:
service A registers a call event in some middleware (event info, "running" status, timestamp, etc).
If this call is not completed after N seconds, some "call timeout" event in the middleware automatically starts the emergency code.
If the call is completed at the proper time service A marks the call status as "completed" in the same middleware and the emergency code will not be run.
P.S. I'm on Java stack.
Thanks!
I recommend to look into patterns such as Retry, Timeout, Circuit Breaker, Fallback and Healthcheck. Or you can also look into the Bulkhead pattern if concurrent calls and fault isolation are your concern.
There are many resources where these well-known patterns are explained, for instance:
https://www.infoworld.com/article/3310946/how-to-build-resilient-microservices.html
https://blog.codecentric.de/en/2019/06/resilience-design-patterns-retry-fallback-timeout-circuit-breaker/
I don't know which technology stack you are on but usually there is already some functionality for these concerns provided already that you can incorporate into your solution. There are libraries that already take care of this resilience functionality and you can, for instance, set it up so that your custom code is executed when some events such as failed retries, timeouts, activated circuit breakers, etc. occur.
E.g. for the Java stack Hystrix is widely used, for .Net you can look into Polly .Net to make use of retry, timeout, circuit breaker, bulkhead or fallback functionality.
Concerning health checks you can look into Actuator for Java and .Net core already provides a health check middleware that more or less provides that functionality out-of-the box.
But before using any libraries I suggest to first get familiar with the purpose and concepts of the listed patterns to choose and integrate those that best fit your use cases and major concerns.
Update
We have to differentiate between two well-known problems here:
1.) How can service A robustly handle temporary outages of service B (or the network connection between service A and B which comes down to the same problem)?
To address the related problems the above mentioned patterns will help.
2.) How to make sure that the request that should be sent to service B will not get lost if service A itself goes down?
To address this kind of problem there are different options at hand.
2a.) The component that performed the request to service A (which than triggers service B) also applies the resilience patterns mentioned and will retry its request until service A successfully answers that it has performed its tasks (which also includes the successful request to service B).
There can also be several instances of each service and some kind of load balancer in front of these instances which will distribute and direct the requests to an available instance (based on regular performed healthchecks) of the specific service. Or you can use a service registry (see https://microservices.io/patterns/service-registry.html).
You can of course chain several API calls after another but this can lead to cascading failures. So I would rather go with an asynchronous communication approach as described in the next option.
2b.) Let's consider that it is of utmost importance that some instance of service A will reliably perform the request to service B.
You can use message queues in this case as follows:
Let's say you have a queue where jobs to be performed by service A are collected.
Then you have several instances of service A running (see horizontal scaling) where each instance will consume the same queue.
You will use message locking features by the message queue service which makes sure that as soon one instance of service A reads a message from the queue the other instances won't see it. If service A was able to complete it's job (i.e. call service B, save some state in service A's persistence and whatever other tasks you need to be included for a succesfull procesing) it will delete the message from the queue afterwards so no other instance of service A will also process the same message.
If service A goes down during the processing the queue service will automatically unlock the message for you and another instance A (or the same instance after it has restarted) of service A will try to read the message (i.e. the job) from the queue and try to perform all the tasks (call service B, etc.)
You can combine several queues e.g. also to send a message to service B asynchronously instead of directly performing some kind of API call to it.
The catch is, that the queue service is some highly available and redundant service which will already make sure that no message is getting lost once published to a queue.
Of course you also could handle jobs to be performed in your own database of service A but consider that when service A receives a request there is always a chance that it goes down before it can save that status of the job to it's persistent storage for later processing. Queue services already address that problem for you if chosen thoughtfully and used correctly.
For instance, if look into Kafka as messaging service you can look into this stack overflow answer which relates to the problem solution when using this specific technology: https://stackoverflow.com/a/44589842/7730554
There is many way to solve your problem.
I guess you are talk about 2 topics Design Pattern in Microservices and Cicruit Breaker
https://dzone.com/articles/design-patterns-for-microservices
To solve your problem, Normally I put a message queue between services and use Service Discovery to detect which service is live and If your service die or orverload then use Cicruit Breaker methods

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

Web server and ZeroMQ patterns

I am running an Apache server that receives HTTP requests and connects to a daemon script over ZeroMQ. The script implements the Multithreaded Server pattern (http://zguide.zeromq.org/page:all#header-73), it successfully receives the request and dispatches it to one of its worker threads, performs the action, responds back to the server, and the server responds back to the client. Everything is done synchronously as the client needs to receive a success or failure response to its request.
As the number of users is growing into a few thousands, I am looking into potentially improving this. The first thing I looked at is the different patterns of ZeroMQ, and whether what I am using is optimal for my scenario. I've read the guide but I find it challenging understanding all the details and differences across patterns. I was looking for example at the Load Balancing Message Broker pattern (http://zguide.zeromq.org/page:all#header-73). It seems quite a bit more complicated to implement than what I am currently using, and if I understand things correctly, its advantages are:
Actual load balancing vs the round-robin task distribution that I currently have
Asynchronous requests/replies
Is that everything? Am I missing something? Given the description of my problem, and the synchronous requirement of it, what would you say is the best pattern to use? Lastly, how would the answer change, if I want to make my setup distributed (i.e. having the Apache server load balance the requests across different machines). I was thinking of doing that by simply creating yet another layer, based on the Multithreaded Server pattern, and have that layer bridge the communication between the web server and my workers.
Some thoughts about the subject...
Keep it simple
I would try to keep things simple and "plain" ZeroMQ as long as possible. To increase performance, I would simply to change your backend script to send request out from dealer socket and move the request handling code to own program. Then you could just run multiple worker servers in different machines to get more requests handled.
I assume this was the approach you took:
I was thinking of doing that by simply creating yet another layer, based on the Multithreaded Server pattern, and have that layer bridge the communication between the web server and my workers.
Only problem here is that there is no request retry in the backend. If worker fails to handle given task it is forever lost. However one could write worker servers so that they handle all the request they got before shutting down. With this kind of setup it is possible to update backend workers without clients to notice any shortages. This will not save requests that get lost if the server crashes.
I have the feeling that in common scenarios this kind of approach would be more than enough.
Mongrel2
Mongrel2 seems to handle quite many things you have already implemented. It might be worth while to check it out. It probably does not completely solve your problems, but it provides tested infrastructure to distribute the workload. This could be used to deliver the request to be handled to multithreaded servers running on different machines.
Broker
One solution to increase the robustness of the setup is a broker. In this scenario brokers main role would be to provide robustness by implementing queue for the requests. I understood that all the requests the worker handle are basically the same type. If requests would have different types then broker could also do lookups to find correct server for the requests.
Using the queue provides a way to ensure that every request is being handled by some broker even if worker servers crashed. This does not come without price. The broker is by itself a single point of failure. If it crashes or is restarted all messages could be lost.
These problems can be avoided, but it requires quite much work: the requests could be persisted to the disk, servers could be clustered. Need has to be weighted against the payoffs. Does one want to use time to write a message broker or the actual system?
If message broker seems a good idea the time which is required to implement one can be reduced by using already implemented product (like RabbitMQ). Negative side effect is that there could be a lot of unwanted features and adding new things is not so straight forward as to self made broker.
Writing own broker could covert toward inventing the wheel again. Many brokers provide similar things: security, logging, management interface and so on. It seems likely that these are eventually needed in home made solution also. But if not then single home made broker which does single thing and does it well can be good choice.
Even if broker product is chosen I think it is a good idea to hide the broker behind ZeroMQ proxy, a dedicated code that sends/receives messages from the broker. Then no other part of the system has to know anything about the broker and it can be easily replaced.
Using broker is somewhat developer time heavy. You either need time to implement the broker or time to get use to some product. I would avoid this route until it is clearly needed.
Some links
Comparison between broker and brokerless
RabbitMQ
Mongrel2

Resources