How to rollback MicroServices

How to rollback MicroServices - spring-boot

I have doubt related to MicroServices. Suppose there are 5 Micro-Services, lets say M1, M2, M3, M3, M4 and M5. There are 4 databases which are connected/accessed by 4 micro-services.
For example, M2 connected to MySQL, M3 connected to Cassandra, M4 connected to MongoDB and M5 connected to Oracle.
Now
Step-1: M1 making a call to M2 to update some user data in MySQL and it updated successfully then finally it got success response from M2
Step-2: M1 making a call to M3 to update some data in Cassandra and it updated successfully then finally it got success response from M3
Step-3: M1 making a call to M4 to update some data in MongoDB and it failed due to some DB server problem or any other problem.
Here my requirement is, I want to rollback DB changes that happened to previous micro-services(M2 and M3)
What should we need to do to achieve this kind of rollback scenario?

This is a typical case of distributed transaction. Regardless of the fact that you use separate technology for your database or the same on different server you perform an operation which is transactional.
In order to handle a rollback on that type of transaction you can not relay on the database technology mechanism for transactions and rollbacks. You have to do it on your own.
Saga Pattern
Common solution for distributed transaction scenarios in micro-service architecture is the Saga pattern.
Distributed sagas is a pattern for managing failures in scenarios as the one that you have described.
Saga are created based on business process for example "Buy a Product in online shop". This process can involve multiple actions on multiple micro-services. Saga will control and manage this process execution and if one of the steps fail it will trigger actions to revert the actions done before the failing action.
There are multiple ways to implement sagas. It depends on your architecture and the way your micro-services communicate with each other. Do you use Commands and/or Events?
Example
"Buy a Product in online shop" business process. Lets say this business process has 3 simple steps done by 3 different micro-services:
Action 1 - Reserve Product in products-inventory-micro-service
Action 2 - Validate payment in payment-micro-service
Action 3 - Order a product in orders-micro-service
Using Events:
You can publish events to perform some action(or actions) and if one of the action fails you can publish a revert(or delete) event for that event. For the above business process lets say the 1. Action succeeded and the 2. Action failed. In this case in order to rollback the 1. Action you would publish an event like "RemoveReservationFromProduct" in order to remove the reservation and revert the state back to the state as it was before the transaction for that Business process started. This event would be picked up by a event handler which would go and revert that state in your database. Since it is an event you can implement retry mechanism for failures or just reapply it later if there is some bug in the code.
Using commands:
If you have direct calls to your micro-services as commands using some kind of rest api you could run some delete or update endpoints to revert the changes that you have done. For the above business process lets say the 1. Action succeeded and the 2. Action failed. In this case in order to rollback the 1. Action you would call the delete api to delete the reservation for a particular product in order to remove the reservation and revert the state back to the state as it was before the transaction for that Business process started.
You can take a look at this example how to implement the Saga pattern.

From what I understand, a Saga is what you are looking for.
The idea is to provide for every state altering operation an undo-operation, that has to be called if things went bad down stream.

You can make sure that you have #Transactional enabled in this entire sequence of Invocation.
Consider invocation of all microservices from M1 as single transaction.
Expose a rollback in following way:
While updating DB in M2, M3 and M4, place the values in Spring cache as well along
with DB.
Upon invoking /rollback in M2, M3 or M4, get the values from Spring Cache and undo
them from DB.
In the fallbackMethod of hysterix command, when M1 replies with error or some default output, invoke /rollback of other services.
This may not be a perfect solution, as it introduces another fault point as /rollback handling, but fastest one that can be implemented.

to answer your question lets add some business requirements
Case 1. M1 is doing all interaction with other microservices based on an event recieved like Order Placed
Now in this case M2 ... M5 update ,
requirement 1: if all of them are independent of each other.
first create 5 event from one event and then
in such a case you could add this event in a table mark this event as unprocessed and some timer reads unprocessed event and tries to do all the tasks in a Idempotent way, also you could have reporting if such tasks are failing and your team can check them and manually resolve them.
(you could implement a similar logic by using a failover queue - which sends the same event back to the original queue after some time)
requirement 2: if all are not independent
use a single event and still the same solution.
in the above solution the main benefit is even if your system restart in between the transactions you will alwayss eventually have the consistent system
Case 2. if the M1 api is invoked and M1 needs to do all tasks from multiple microservice and then give response to user.
we could create a started event in M1 microservice DB (sync_event_table)
try to do update in all microservice
after all complete , update the sync event table with completed
for those cases which are not completed - run a timer which checks for job which are not completed for > X min and then do the undo actions or whatever required,.
Essence:
So if you see all solutions suggests your system to turn all the diff. microservice update
by creating a job
checking job status
writing a undo/redo job feature

Related

saga pattern: what about if compensation action fails

We're trying to understand how to compensate a "saga compensation failure".
We have two microservices, and two databases, one per microservice.
Customer microservice
Contract microservice
Use case: Customer alias modification.
Request is sent to "Customer microservice".
a. Customer alias is modified on customer table, but its state is pending.
b. A customer modified event is sent.
customer modified event is received by "Constract microservice".
a. Received Customer is updated on all contracts (we're using mongodb), since customer information is embedded in each contract.
b. A contract updated event is sent.
contract updated event is received by "Customer microservice".
a. Customer's state is set to confirmed.
If 3.a fails a compensation action is performed, but what about if it fails?

This can be handle with combination of below approaches:
Implement Retry pattern for Compensate Action
Exception handling - exception can be save and, this can be resolve through - Automated process like Retry mechanism through separate application
This is extension of approach#2, If Automated process unable to resolve, generate exception report which can be review manually and action can be taken based on the issue.

It looks like you are using the term saga but you really mean you want a transaction. If you really need a transaction do that (you can look at solutions like https://docs.temporal.io/ for providing that)
[personally I think transactions between services are bad, and if I need transaction between services, I try to rethink my design but your milage may vary]
You didn't specify the reason on why contracts would reject the change - if there are business rules that one thing but if these are "technical reasons" like availability etc. then the thing to do is to make sure the event is persistent and was sent (e.g. like outbox pattern on the sending side) and have the consuming service(s) handle it when it can
If there are business rules involved then maybe it is a bad example but I'd expect a person can still change their alias regardless and the compensation would be keeping some of the contracts with the old alias or something a long these lines.
by the way, it seems you have a design issue that causes needless temporal coupling between your services.
If the alias is important in contracts but owned by the customers service, the alias stored in the contracts should be considered as cached.
In this case the customers service can close the update regardless of what other services do. it can fire the event and you can complete the process when you can on the contracts service. when a contract is read you can check if there's a newer version of the customer and if so get it. you may also (depending on the business reqs. specify that the data is correct as of the last update)

BASE VS ACID :
ISOLATION: As local transactions are committed while the Saga is running, their changes are already visible to other concurrent transactions, despite the possibility that the Saga will fail eventually, causing all previously applied transactions to be compensated. I.e., from the perspective of the overall Saga, the isolation level is comparable to “read uncommitted.”
Eventually other services will read those inconsistent events, they will also take wrong decisions according to these, they will increase the number of events which should not be happen at all.
In the end there will be tons of events to rollback (how is that possible if your system let users to do more than allowed in real world ? Can you get back an ice cream from a kid which is sold 5 minute ago !)

How to rollback distributed transactions?

I have three different Spring boot Projects with separated databases e.g account-rest, payment-rest, gateway-rest.
account-rest : create a new account
payment-rest : create a new payment
gateway-rest : calls other endpoints
at gateway-rest there is an endpoint which calls the other two endpoints.
#GetMapping("/gateway-api")
#org.springframework.transaction.annotation.Transactional(rollbackFor = RuntimeException.class)
public String getApi()
{
String accountId = restTemplate.getForObject("http://localhost:8686/account", String.class);
restTemplate.getForObject("http://localhost:8585/payment?accid="+accountId, String.class);
throw new RuntimeException("rollback everything");
}
I want to rollback transactions and revert everything when I throw exception at gateway or anyother endpoints.
How can I do that ?

It is impossible rollback external dependencies accessible via rest or something like that.
The only think that you can do is compensate errors, you can use pattern like SAGA
I hope that is can help you

You are basically doing dual persistence. That's not ideally a good thing because of 2 reasons
It increases the latency and thus have a direct impact on user experience
What if one of them fails?
As the other answer pointed out SAGA pattern is an option to post compensation transaction.
The other option and it's better to go with this by all means is to avoid dual persistence by writing to only one service synchronously and then use Change Data Capture (CDC) to asynchronously upate the other service. If we can design in this way, we can ensure atomicity (all or nothing) and thus probably the rollback scenario itself will not surface.
Refer to these two answers also, if they help:
https://stackoverflow.com/a/54676222/1235935
https://stackoverflow.com/a/54527066/1235935
By all means avoid distributed transactions or 2-phase commit. It's not a good solution and creates lot of operational overhead, locking etc. when the transaction co-ordinator fails after prepare phase and before commit phase. Worse things happen when transaction co-ordinator gets its data corrupted.

For that purpose you need external transaction management system. It will handle distributed transations and commit/rollback when its finished on all services.
Possible flow example:
Request coming
gateway-rest starts a distributed transaction and local transaction and sends a request(with transaction id) to payment-rest. Thread with transaction lives until all local transactions is finished.
payment-rest knows about global transaction and starts own local transaction.
When all local transactions marked as commited, TM(transaction manager) sends a request to each service to close local transactions and close global transaction.

In your case you can use sagas as mentioned by many others, but they require events and async in nature.
if you want a sync kind of API. you can do something similar to this:
first lets take an example in case of amazon, for creating a order and getting balance out of your wallet and completing the order:
create Order in PendingState
reserveBalance in Account service for order id
if balance reserved change Order state to Confirmed (also having the transaction id for the reserve) and update reserveBalanceConsumed to Account Service
else change Order state to Cancelled with reason , "not enough Balance"
Now there are cases where lets says account service balance is reserved but for some reason order is either not confirmed.
Then somebody could periodically check that if there are reserve Balance for some order and time>30 min let say then check whether that order is marked as confirmed with that trnasaction id , call reserveBalanceConsumed , else cancel that order with reason "some error please try again" , mark balance as free
NOW THESE TYPE OF SYSTEMS ARE COMPLEX TO BUILD. Use the Saga pattern in general for simpler structure.

Event sourcing - error handling when events are not created

As to my understanding, in event sourcing, events are recorded. However that would also mean a state changed first happened and thereafter we record the event. For example, assuming:
A Client sends a command to a server to "Create user".
The server validates the command and creates user i.e. stores new
user in a database.
The server then logs/stores a Created User event. i.e event
sourcing.
Created User event is propagated to subscribers
In the scenario above, how do we handle scenarios where step (2) succeeded but step (3) failed due to say network failures, database offline etc? The whole system would be in an indeterminate state now that there was a new user created but the event was never logged. How do we mitigate these types of failures? Or are the steps that I've listed above not the way to do event sourcing?
Thanks!

This is not what happens exactly in Event sourcing, not even in plain CQRS.
In Event sourcing, after the command is validated, the domain events are generated by the source (the Aggregate in DDD) and then they are appended to the Event store in the first step. After that the subscribers (read models, projections, Sagas, external systems) receive and process the new domain events.
In CQRS, after the domain events are generated, they are applied to the Aggregate and then the Aggregate's state and the new events are persisted in the same local transaction, as the first step. Only after that the subscribers receive the events.
So you see? Your situation cannot happen: steps 2 and 3 are persisted atomically, they succeed or fail together.

Microservice deployed - initial data migration

Let's say that we have microservice A (MS A) and Microservice B (MS B).
MS B has data about Products. MS A needs the productnames of MS B.
Each time a product is added, updated or deleted, MS B puts a message on a message queue.
MS A is subscribed to that queue, so it can updated it's own internal state.
Now my question:
How do we fill the internal state of MS A when we deploy it to production the first time?
I couldn't find any documentation about the pros and cons of the possible solutions.
I could think of:
Export/import on database level.
Pros: not much work.
Cons: can miss data if during export/import changes to the data of MS A are made.
Implement calls for GetData and GetDataChangedSince
Pros: failsafe
Cons: a lot of work
Are there any other options? Are there any other pros/cons?

You could use the following workflow:
prepare the microservice B to push the events to the queue or stop it if it is already pushing to the queue; instead, it pushes to a circular buffer (a buffer that is rewritten when full) and waits for a signal from microservice A
deploy the microservice A into production servers but you don't reference it from nowhere; it just runs, waiting for events in the queue
run a script that get all product names from microservice B and push them into the queue as a simulated event; when it finishes the product names it signals the microservice B (optionally telling the date or sequence number or whatever de-duplication technique you have to detect duplicate events)
microservice B then copy the events from the buffer newer that the last pushed by microservice A (or it finds out itself from the queue what is the last one) into the queue and then ignores the buffer and continue to work as normally.

It sounds like there is a service/API call missing from you architecture. Moving a service into production should be no different than recovering from a a failure and should not require any additional steps. Perhaps the messages should be consumed from the queue by another service that can then be queried for the complete list of products.

CQRS + Microservices Handling event rollback

We are using microservices, cqrs, event store using nodejs cqrs-domain, everything works like a charm and the typical flow goes like:
REST->2. Service->3. Command validation->4. Command->5. aggregate->6. event->7. eventstore(transactional Data)->8. returns aggregate with aggregate ID-> 9. store in microservice local DB(essentially the read DB)-> 10. Publish Event to the Queue
The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
Any suggestions would be highly appreciated.

The problem with the flow above is that since the transactional data save i.e. persistence to the event store and storage to the microservice's read data happen in a different transaction context if there is any failure at step 9 how should i handle the event which has already been propagated to the event store and the aggregate which has already been updated?
You retry it later.
The "book of record" is the event store. The downstream views (the "published events", the read models) are derived from the book of record. They are typically behind the book of record in time (eventual consistency) and are not typically synchronized with each other.
So you might have, at some point in time, 105 events written to the book of record, but only 100 published to the queue, and a representation in your service database constructed from only 98.
Updating a view is typically done in one of two ways. You can, of course, start with a brand new representation and replay all of the events into it as part of each update. Alternatively, you track in the metadata of the view how far along in the event history you have already gotten, and use that information to determine where the next read of the event history begins.

Inside your event store, you could track whether read-side replication was successful.
As soon as step 9 suceeds, you can flag the event as 'replicated'.
That way, you could introduce a component watching for unreplicated events and trigger step 9. You could also track whether the replication failed multiple times.
Updating the read-side (step 9) and flagigng an event as replicated should happen consistently. You could use a saga pattern here.

I think i have now understood it to a better extent.
The Aggregate would still be created, answer is that all the validations for any type of consistency should happen before my aggregate is constructed, it is in case of a failure beyond the purview of the code that a failure exists while updating the read side DB of the microservice which needs to be handled.
So in an ideal case aggregate would be created however the event associated would remain as undispatched unless all the read dependencies are updated, if not it remains as undispatched and that can be handled seperately.
The Event Store will still have all the event and the eventual consistency this way is maintained as is.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio