Database base failure in microservice architecture - microservices

Suppose we are in a microservice architecture with
2 microservices with API interfaces for synchronous calls
1 RDBMS with 1 DB per microservices
1 queue system for aysnchronous calls
User A make a request to an endpoint of microservice 1 using its API.
The endpoint task is for exemple calculate something and then put the result in a table of the microservice DB.
How to handle failure of the database during the request of the user ?
Example:
During the request, the database crashes.
What to do then ?
Return an error ? But what's error ? 500 ?
But isn't the microservice archicture supposed to avoid this type of coupling ?
Shall we make the system more loosely coupled ?
Shall the microservice save the data in a local file or queue and retry to insert in db ?
But what about user ? It will be impossible for him to retrieve/updated the data it just create, appart the system return the result from the local data.... but it's very complex no ?
How can we achieve that...
I got the same doubt with the use of queue systems.
In a event driven design, we have along the microservice a consumer and producer.
The consumer listen to topic in the event bus then can insert data in its db.
The producer is called when action is triggered on DB insert for its own data and send it to event bus on a topic.
Imagine the event bus crashes....
If so, the consumer will crash too in the microservice.
If db insert occur, the producer could not emit the event to the event bus....
So data is lost ?
So shall the producer keep data in its local storage for retrying ?
I have returned this question many time in my head and I didn't have a resilient system.

Return an error ? But what's error ? 500 ?
Yes return an error. There is nothing wrong in returning an Error from your micro-service. Usually "500 Internal Server Error" is the right error in the case you have a failure in your database. This is a standard behavior of an Rest API.
But isn't the microservice architecture supposed to avoid this type of
coupling ? Shall we make the system more loosely coupled ?
I think there is a confusion here. A micro-service communication with its own database is not considered as coupling. Micro-service-A which is using its won database micro-service-A-database is considered as one logical unit or vertical. The micro-service-A would not be much of a use without its database and vise versa. This is totally ok and you can look at in similar way as with standard WebApplication with its Frontend, Backend(similar as your service) and Database. Coupling should be avoided across different micro-services. For example micro-service-A should not be tightly couple with with micro-service-B or micro-service-C. Each micro-service should be atomic and as independent as possible but not from its database, cache or similar. You can consider the database as logical part of it.
Shall the micro-service save the data in a local file or queue and
retry to insert in db ?
No it is expected that the database could fail or at least you have to deal with the option as well. From the User prospective you would just return the Error Code 500. At least for most cases this would be the expected behavior. There are some special cases where you want at any cost to save the data and not lose it for that request(there are ways to deal with this as well).
But what about user ?
If it is a standard Web user then he would retry a couple of times and if the problem persists probably come later and try again(there is nothing wrong in returning the Error 500 here). If by user you mean another micro-service is doing the call then that caller-micro-service has to expect that failure can happen. Now it depends what you are doing here? Example consider micro-service-A is calling micro-service-B with an Http request:
Get Call: here you can build a retry policy in micro-service-A and if micro-service-B responds with 500 after retries you can return an error to the user who called micro-service-B.
Post/Put/Patch Call: here you can also try similar as with the Get Calls but only if 1 service is involved. If you have micro-service-A calling micro-service-B and then micro-service-C and if one call was successfully(which saved some data) and another one failed you have to consider Sagas(if the operation should be transactional).
Imagine the event bus crashes.... If so, the consumer will crash too
in the microservice.
If your local micro-service database crushes then all the other channels should crash as well. If you can not save your entity in your local micro-service db why would you go further with the operation like publishing a message to a queue? The source of truth of your data/entities is the database(at least in most cases). So if the database fails you should throw an exception and return an error to the caller/user.
If db insert occur, the producer could not emit the event to the event
bus.... So data is lost ? So shall the producer keep data in its local
storage for retrying ?
So in the case where you save your data/entity in the database and the queue is not available you could simply save the Message/Event to a table in the DB and then publish the Message/Event when the queue is up and running again. Actually this is a very common pattern in this situations. For example your implementation could be Transactional:
Save entity to its table and
Save event/message to Event table. This way if one fails the
operation will be rolled back.
You can have a background worker(depending on the tech you are using for your Back-end) to publish messages into the queue in async way.

Related

saga pattern: what about if compensation action fails

We're trying to understand how to compensate a "saga compensation failure".
We have two microservices, and two databases, one per microservice.
Customer microservice
Contract microservice
Use case: Customer alias modification.
Request is sent to "Customer microservice".
a. Customer alias is modified on customer table, but its state is pending.
b. A customer modified event is sent.
customer modified event is received by "Constract microservice".
a. Received Customer is updated on all contracts (we're using mongodb), since customer information is embedded in each contract.
b. A contract updated event is sent.
contract updated event is received by "Customer microservice".
a. Customer's state is set to confirmed.
If 3.a fails a compensation action is performed, but what about if it fails?
This can be handle with combination of below approaches:
Implement Retry pattern for Compensate Action
Exception handling - exception can be save and, this can be resolve through - Automated process like Retry mechanism through separate application
This is extension of approach#2, If Automated process unable to resolve, generate exception report which can be review manually and action can be taken based on the issue.
It looks like you are using the term saga but you really mean you want a transaction. If you really need a transaction do that (you can look at solutions like https://docs.temporal.io/ for providing that)
[personally I think transactions between services are bad, and if I need transaction between services, I try to rethink my design but your milage may vary]
You didn't specify the reason on why contracts would reject the change - if there are business rules that one thing but if these are "technical reasons" like availability etc. then the thing to do is to make sure the event is persistent and was sent (e.g. like outbox pattern on the sending side) and have the consuming service(s) handle it when it can
If there are business rules involved then maybe it is a bad example but I'd expect a person can still change their alias regardless and the compensation would be keeping some of the contracts with the old alias or something a long these lines.
by the way, it seems you have a design issue that causes needless temporal coupling between your services.
If the alias is important in contracts but owned by the customers service, the alias stored in the contracts should be considered as cached.
In this case the customers service can close the update regardless of what other services do. it can fire the event and you can complete the process when you can on the contracts service. when a contract is read you can check if there's a newer version of the customer and if so get it. you may also (depending on the business reqs. specify that the data is correct as of the last update)
BASE VS ACID :
ISOLATION: As local transactions are committed while the Saga is running, their changes are already visible to other concurrent transactions, despite the possibility that the Saga will fail eventually, causing all previously applied transactions to be compensated. I.e., from the perspective of the overall Saga, the isolation level is comparable to “read uncommitted.”
Eventually other services will read those inconsistent events, they will also take wrong decisions according to these, they will increase the number of events which should not be happen at all.
In the end there will be tons of events to rollback (how is that possible if your system let users to do more than allowed in real world ? Can you get back an ice cream from a kid which is sold 5 minute ago !)

Message Based Microservices - Api Gateway Performance

I'm in the process of designing a micro-service architecture and I have a performance related question. This is what I am trying out with my design:
I have a several micro-services which perform distinct actions and store those results in their own data-store.
The micro-services receive work via a message queue where they receive requests to run their process for the specific data given. The micro-services do NOT communicate with each other.
I have an API gateway which effectively has three journeys:
1) Receive a request to process data which it then translates into several messages which it puts on the queue for the micro-services to process in their own time. The processing time can be in minutes or longer (not-instant)
2) Receives a request for the status of the process, where it returns the progress of the overall process.
3) Receives a request for combined data, which is some combination of all the results from the services.
My problem lies in #3 above and the performance of this process.
Whenever this request is received, the api gateway has to put a message request onto the queue for information from all the services, it than has to wait for all the services to reply with the latest state of their data and then it combines this data and returns to the caller.
This process is obviously rather slow as it has to wait for every service to respond. What is the way of speeding this up?
The only way I thought of solving this is having another aggregate service/data-store where duplicate data is stored and queried by my api gateway. I really don't like this approach as it duplicates data and is extra work/code.
What is the 'correct' and performant way of querying up-to-date data from my micro-services.
You can use these approach for Querying data across microservices. Reference
Selective data replication
With this approach, we replicate the data needed from other microservices into the database of our microservice. The only coupling between microservices is in the data replication configuration.
Composite service layer
With this approach, you introduce composite services that aggregate data from lower-level microservices.

How to handle events processing time between services

Let's say we have two services A and B. B has a relation to A so it needs to know about the existing entities of A.
Service A publishes events every time an entity is created or updated. Service B subscribes to the events published by A and therefore knows about the entities existing in service A.
Problem: The client (UI or other micro services) creates a new entity 'a' and right away creates a new entity 'b' with a reference to 'a'. This is done without much delay so what happens if service B did not receive/handle the event from B before getting the create request with a reference to 'b'?
How should this be handled?
Service B must fail and the client should handle this and possibly do retry.
Service B accepts the entity and over time expect the relation to be fulfilled when the expected event is received. Service B provides a state for the entity that ensures it cannot be trusted before the relation have been verified.
It is poor design that the client can/has to do these two calls in the same transaction. The design should be different. How?
Other ways?
I know that event platforms like Kafka ensures very fast event transmittance but there will always be a delay and since this is an asynchronous process there will be kind of a race condition.
What you're asking about falls under the general category of bridging the gap between Eventual Consistency and good User Experience which is a well-documented challenge with a distributed architecture. You have to choose between availability and consistency; typically you cannot have both.
Your example raises the question as to whether service boundaries are appropriate. It's a common mistake to define microservice boundaries around Entities, but that's an anti-pattern. Microservice boundaries should be consistent with domain boundaries related to the business use case, not how entities are modeled within those boundaries. Here's a good article that discusses decomposition, but the TL;DR; is:
Microservices should be verbs, not nouns.
So, for example, you could have a CreateNewBusinessThing microservice that handles this specific case. But, for now, we'll assume you have good and valid reasons to have the services divided as they are.
The "right" solution in your case depends on the needs of the consuming service/application. If the consumer is an application or User Interface of some sort, responsiveness is required and that becomes your overriding need. If the consumer is another microservice, it may well be that it cares more about getting good "finalized" data rather than being responsive.
In either of those cases, one good option is a facade (aka gateway) service that lives between your client and the highly-dependent services. This service can receive and persist the request, then respond however you'd like. It can give the consumer a 200 - OK response with an endpoint to call back to check status of the request - very responsive. Or, it could receive a URL to use as a webhook when the response is completed from both back-end services, so it could notify the client directly. Or it could publish events of its own (it likely should). Essentially, you can tailor the facade service to provide to as many consumers as needed in the way each consumer wants to talk.
There are other options too. You can look into Task-Based UI, the Saga pattern, or even just Faking It.
I think you would like to leverage the flexibility of a broker and the confirmation of a synchronous call . Both of them can be achieved by this
https://www.rabbitmq.com/tutorials/tutorial-six-dotnet.html

How to rollback distributed transactions?

I have three different Spring boot Projects with separated databases e.g account-rest, payment-rest, gateway-rest.
account-rest : create a new account
payment-rest : create a new payment
gateway-rest : calls other endpoints
at gateway-rest there is an endpoint which calls the other two endpoints.
#GetMapping("/gateway-api")
#org.springframework.transaction.annotation.Transactional(rollbackFor = RuntimeException.class)
public String getApi()
{
String accountId = restTemplate.getForObject("http://localhost:8686/account", String.class);
restTemplate.getForObject("http://localhost:8585/payment?accid="+accountId, String.class);
throw new RuntimeException("rollback everything");
}
I want to rollback transactions and revert everything when I throw exception at gateway or anyother endpoints.
How can I do that ?
It is impossible rollback external dependencies accessible via rest or something like that.
The only think that you can do is compensate errors, you can use pattern like SAGA
I hope that is can help you
You are basically doing dual persistence. That's not ideally a good thing because of 2 reasons
It increases the latency and thus have a direct impact on user experience
What if one of them fails?
As the other answer pointed out SAGA pattern is an option to post compensation transaction.
The other option and it's better to go with this by all means is to avoid dual persistence by writing to only one service synchronously and then use Change Data Capture (CDC) to asynchronously upate the other service. If we can design in this way, we can ensure atomicity (all or nothing) and thus probably the rollback scenario itself will not surface.
Refer to these two answers also, if they help:
https://stackoverflow.com/a/54676222/1235935
https://stackoverflow.com/a/54527066/1235935
By all means avoid distributed transactions or 2-phase commit. It's not a good solution and creates lot of operational overhead, locking etc. when the transaction co-ordinator fails after prepare phase and before commit phase. Worse things happen when transaction co-ordinator gets its data corrupted.
For that purpose you need external transaction management system. It will handle distributed transations and commit/rollback when its finished on all services.
Possible flow example:
Request coming
gateway-rest starts a distributed transaction and local transaction and sends a request(with transaction id) to payment-rest. Thread with transaction lives until all local transactions is finished.
payment-rest knows about global transaction and starts own local transaction.
When all local transactions marked as commited, TM(transaction manager) sends a request to each service to close local transactions and close global transaction.
In your case you can use sagas as mentioned by many others, but they require events and async in nature.
if you want a sync kind of API. you can do something similar to this:
first lets take an example in case of amazon, for creating a order and getting balance out of your wallet and completing the order:
create Order in PendingState
reserveBalance in Account service for order id
if balance reserved change Order state to Confirmed (also having the transaction id for the reserve) and update reserveBalanceConsumed to Account Service
else change Order state to Cancelled with reason , "not enough Balance"
Now there are cases where lets says account service balance is reserved but for some reason order is either not confirmed.
Then somebody could periodically check that if there are reserve Balance for some order and time>30 min let say then check whether that order is marked as confirmed with that trnasaction id , call reserveBalanceConsumed , else cancel that order with reason "some error please try again" , mark balance as free
NOW THESE TYPE OF SYSTEMS ARE COMPLEX TO BUILD. Use the Saga pattern in general for simpler structure.

How to solve two generals issue between event store and persistence layer?

Two General Problems - EventStore and persistence layer?
I would like to understand how industry is actually dealing with this problems!
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
Now, the question I have is, where do I write object X first?
Database A first and then to event store B, is it fair to roll back the thread at the app level if Database A is down? Also, what should be the ideal error handle if Database A is online and persisted object X but event store B is down?
What should be the error handle look like if we go vice-versa of point 1?
I do understand that in today's world of distributed high-available systems, systems going down is questionable thing. But, it can happen. I want to understand what needs to be done when either database or event store system/cluster is down?
In general you want to avoid relying on a two-phase commit of the kind you describe.
In general, (presuming an event-sourced system; not sure if that's implicit in your question/an option for you - perhaps SqlStreamStore might be relevant in your context?), this is typically managed by having something project from from a single authoritative set of events on a pull basis - each event being written that requires an associated action against some downstream maintains a pointer to how far it has got projecting events from the base stream, and restarts from there if interrupted.
First of all, an Event store is a type of Persistence, which stores the applications state as a series of events as opposed to a flat persistence that stores the last projected state.
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
You are trying to have two sources of truth that must be kept in sync by some sort of distributed transaction which is not very scalable.
This is an unusual mode of using an Event store. In general an Event store is the canonical source of information, the single source of truth. You are trying to use it as an communication channel. The Event store is the persistence of an event-sourced Aggregate (see Domain Driven Design).
I see to options:
you could refactor your architecture and make the object X and event-sourced entity having as persistence the Event store. Then have a Read-model subscribe to the Event store and build a flat representation of the object X that is persisted in the database A. In other words, write first to the Event store and then in the Database A (but in an eventually consistent manner!). This is a big jump and you should really think if you want to go event-sourced.
you could use CQRS without Event sourcing. This means that after every modification, the object X emits one or more Domain events, that are persisted in the Database A in the same local transaction as the object X itself. The microservice 2 could subscribe to the Database A to get the emitted events. The actual subscribing depends on the type of database.
I have a feeling you are using event store as a channel of communication, instead of using it as a database. If you want micro-service 2 to feed on the data from micro-service 1, then you should communicate with REST services.
Of course, relying on REST services might make you less resilient to outages. In that case, using a piece of technology dedicated to communication would be the right way to go. (I'm thinking MQ/Topics, such as RabbitMQ, Kafka, etc.)
Then, once your services are talking to each other, you will still need to persist your data... but only at one single location.
Therefore, you will need to define where you want to store the data.
Ask yourself:
Who will have the governance of the data persistance ?
Is it Microservice1 ? if so, then everytime Microservice2 needs to read the data, it will make a REST call to Microservice1.
is it the other way around ? Microservice2 has the governance of the data, and Microservice1 consumes it ?
It could be a third microservice that you haven't even created yet. It depends how you applied your separation of concerns.
Let's take an example :
Microservice1's responsibility is to process our data to export them in PDF and other formats
Microservice2's responsibility is to expose a service for a legacy partner, that requires our data to be returned in a very proprietary representation.
who is going to store the data, here ?
Microservice1 should not be the one to persist the data : its job is only to convert the data to other formats. If it requires some data, it will fetch them from the one having the governance of the data.
Microservice2 should not be the one to persist the data. After all, maybe we have a number of other Microservices similar to this one, but for other partners, with different proprietary formats.
If there is a service where you can do CRUD operations, this is your guy. If you don't have such a service, maybe you can find an existing Microservice who wouldn't have conflicting responsibilities.
For instance : if I have a Microservice3 that makes sure everytime an my ObjectX is changed, it will send a PDF-representation of it to some address, and notify all my partners that the data are out-of-date. In that scenario, this Microservice looks like a good candidate to become the "governor of the data" for this part of the domain, and be the one-stop-shop for writing/reading in the database.

Resources