How to rollback distributed transactions?

How to rollback distributed transactions? - spring

I have three different Spring boot Projects with separated databases e.g account-rest, payment-rest, gateway-rest.
account-rest : create a new account
payment-rest : create a new payment
gateway-rest : calls other endpoints
at gateway-rest there is an endpoint which calls the other two endpoints.
#GetMapping("/gateway-api")
#org.springframework.transaction.annotation.Transactional(rollbackFor = RuntimeException.class)
public String getApi()
{
String accountId = restTemplate.getForObject("http://localhost:8686/account", String.class);
restTemplate.getForObject("http://localhost:8585/payment?accid="+accountId, String.class);
throw new RuntimeException("rollback everything");
}
I want to rollback transactions and revert everything when I throw exception at gateway or anyother endpoints.
How can I do that ?

It is impossible rollback external dependencies accessible via rest or something like that.
The only think that you can do is compensate errors, you can use pattern like SAGA
I hope that is can help you

You are basically doing dual persistence. That's not ideally a good thing because of 2 reasons
It increases the latency and thus have a direct impact on user experience
What if one of them fails?
As the other answer pointed out SAGA pattern is an option to post compensation transaction.
The other option and it's better to go with this by all means is to avoid dual persistence by writing to only one service synchronously and then use Change Data Capture (CDC) to asynchronously upate the other service. If we can design in this way, we can ensure atomicity (all or nothing) and thus probably the rollback scenario itself will not surface.
Refer to these two answers also, if they help:
https://stackoverflow.com/a/54676222/1235935
https://stackoverflow.com/a/54527066/1235935
By all means avoid distributed transactions or 2-phase commit. It's not a good solution and creates lot of operational overhead, locking etc. when the transaction co-ordinator fails after prepare phase and before commit phase. Worse things happen when transaction co-ordinator gets its data corrupted.

For that purpose you need external transaction management system. It will handle distributed transations and commit/rollback when its finished on all services.
Possible flow example:
Request coming
gateway-rest starts a distributed transaction and local transaction and sends a request(with transaction id) to payment-rest. Thread with transaction lives until all local transactions is finished.
payment-rest knows about global transaction and starts own local transaction.
When all local transactions marked as commited, TM(transaction manager) sends a request to each service to close local transactions and close global transaction.

In your case you can use sagas as mentioned by many others, but they require events and async in nature.
if you want a sync kind of API. you can do something similar to this:
first lets take an example in case of amazon, for creating a order and getting balance out of your wallet and completing the order:
create Order in PendingState
reserveBalance in Account service for order id
if balance reserved change Order state to Confirmed (also having the transaction id for the reserve) and update reserveBalanceConsumed to Account Service
else change Order state to Cancelled with reason , "not enough Balance"
Now there are cases where lets says account service balance is reserved but for some reason order is either not confirmed.
Then somebody could periodically check that if there are reserve Balance for some order and time>30 min let say then check whether that order is marked as confirmed with that trnasaction id , call reserveBalanceConsumed , else cancel that order with reason "some error please try again" , mark balance as free
NOW THESE TYPE OF SYSTEMS ARE COMPLEX TO BUILD. Use the Saga pattern in general for simpler structure.

Related

saga pattern: what about if compensation action fails

We're trying to understand how to compensate a "saga compensation failure".
We have two microservices, and two databases, one per microservice.
Customer microservice
Contract microservice
Use case: Customer alias modification.
Request is sent to "Customer microservice".
a. Customer alias is modified on customer table, but its state is pending.
b. A customer modified event is sent.
customer modified event is received by "Constract microservice".
a. Received Customer is updated on all contracts (we're using mongodb), since customer information is embedded in each contract.
b. A contract updated event is sent.
contract updated event is received by "Customer microservice".
a. Customer's state is set to confirmed.
If 3.a fails a compensation action is performed, but what about if it fails?

This can be handle with combination of below approaches:
Implement Retry pattern for Compensate Action
Exception handling - exception can be save and, this can be resolve through - Automated process like Retry mechanism through separate application
This is extension of approach#2, If Automated process unable to resolve, generate exception report which can be review manually and action can be taken based on the issue.

It looks like you are using the term saga but you really mean you want a transaction. If you really need a transaction do that (you can look at solutions like https://docs.temporal.io/ for providing that)
[personally I think transactions between services are bad, and if I need transaction between services, I try to rethink my design but your milage may vary]
You didn't specify the reason on why contracts would reject the change - if there are business rules that one thing but if these are "technical reasons" like availability etc. then the thing to do is to make sure the event is persistent and was sent (e.g. like outbox pattern on the sending side) and have the consuming service(s) handle it when it can
If there are business rules involved then maybe it is a bad example but I'd expect a person can still change their alias regardless and the compensation would be keeping some of the contracts with the old alias or something a long these lines.
by the way, it seems you have a design issue that causes needless temporal coupling between your services.
If the alias is important in contracts but owned by the customers service, the alias stored in the contracts should be considered as cached.
In this case the customers service can close the update regardless of what other services do. it can fire the event and you can complete the process when you can on the contracts service. when a contract is read you can check if there's a newer version of the customer and if so get it. you may also (depending on the business reqs. specify that the data is correct as of the last update)

BASE VS ACID :
ISOLATION: As local transactions are committed while the Saga is running, their changes are already visible to other concurrent transactions, despite the possibility that the Saga will fail eventually, causing all previously applied transactions to be compensated. I.e., from the perspective of the overall Saga, the isolation level is comparable to “read uncommitted.”
Eventually other services will read those inconsistent events, they will also take wrong decisions according to these, they will increase the number of events which should not be happen at all.
In the end there will be tons of events to rollback (how is that possible if your system let users to do more than allowed in real world ? Can you get back an ice cream from a kid which is sold 5 minute ago !)

Saga Compensating Transaction

Currently working on initial phase of Microservice Architecture for product. It is evident that for many operation it required to have distributed transaction or say there are couple of operation required across different microservice in-order to complete one business process.
For this purpose I found that Saga is useful. Now for ideal case or where everything goes correct then it works fine but when something is not correct or some activity failed at that moment we may have to rollback those operation. For this there is something called "Compensating" transaction or operation required. Now It is completely possible when operation performed specially for successful operation, it may possible that other transaction also performed on that service so db might be in different state then when actually operation performed.
What will be the solution for this ? One solution I think is that somehow state needs to preserve so it can revisit but like for stock may be change to some other transaction so I feel that compensating transaction would be a problem.

Database base failure in microservice architecture

Suppose we are in a microservice architecture with
2 microservices with API interfaces for synchronous calls
1 RDBMS with 1 DB per microservices
1 queue system for aysnchronous calls
User A make a request to an endpoint of microservice 1 using its API.
The endpoint task is for exemple calculate something and then put the result in a table of the microservice DB.
How to handle failure of the database during the request of the user ?
Example:
During the request, the database crashes.
What to do then ?
Return an error ? But what's error ? 500 ?
But isn't the microservice archicture supposed to avoid this type of coupling ?
Shall we make the system more loosely coupled ?
Shall the microservice save the data in a local file or queue and retry to insert in db ?
But what about user ? It will be impossible for him to retrieve/updated the data it just create, appart the system return the result from the local data.... but it's very complex no ?
How can we achieve that...
I got the same doubt with the use of queue systems.
In a event driven design, we have along the microservice a consumer and producer.
The consumer listen to topic in the event bus then can insert data in its db.
The producer is called when action is triggered on DB insert for its own data and send it to event bus on a topic.
Imagine the event bus crashes....
If so, the consumer will crash too in the microservice.
If db insert occur, the producer could not emit the event to the event bus....
So data is lost ?
So shall the producer keep data in its local storage for retrying ?
I have returned this question many time in my head and I didn't have a resilient system.

Return an error ? But what's error ? 500 ?
Yes return an error. There is nothing wrong in returning an Error from your micro-service. Usually "500 Internal Server Error" is the right error in the case you have a failure in your database. This is a standard behavior of an Rest API.
But isn't the microservice architecture supposed to avoid this type of
coupling ? Shall we make the system more loosely coupled ?
I think there is a confusion here. A micro-service communication with its own database is not considered as coupling. Micro-service-A which is using its won database micro-service-A-database is considered as one logical unit or vertical. The micro-service-A would not be much of a use without its database and vise versa. This is totally ok and you can look at in similar way as with standard WebApplication with its Frontend, Backend(similar as your service) and Database. Coupling should be avoided across different micro-services. For example micro-service-A should not be tightly couple with with micro-service-B or micro-service-C. Each micro-service should be atomic and as independent as possible but not from its database, cache or similar. You can consider the database as logical part of it.
Shall the micro-service save the data in a local file or queue and
retry to insert in db ?
No it is expected that the database could fail or at least you have to deal with the option as well. From the User prospective you would just return the Error Code 500. At least for most cases this would be the expected behavior. There are some special cases where you want at any cost to save the data and not lose it for that request(there are ways to deal with this as well).
But what about user ?
If it is a standard Web user then he would retry a couple of times and if the problem persists probably come later and try again(there is nothing wrong in returning the Error 500 here). If by user you mean another micro-service is doing the call then that caller-micro-service has to expect that failure can happen. Now it depends what you are doing here? Example consider micro-service-A is calling micro-service-B with an Http request:
Get Call: here you can build a retry policy in micro-service-A and if micro-service-B responds with 500 after retries you can return an error to the user who called micro-service-B.
Post/Put/Patch Call: here you can also try similar as with the Get Calls but only if 1 service is involved. If you have micro-service-A calling micro-service-B and then micro-service-C and if one call was successfully(which saved some data) and another one failed you have to consider Sagas(if the operation should be transactional).
Imagine the event bus crashes.... If so, the consumer will crash too
in the microservice.
If your local micro-service database crushes then all the other channels should crash as well. If you can not save your entity in your local micro-service db why would you go further with the operation like publishing a message to a queue? The source of truth of your data/entities is the database(at least in most cases). So if the database fails you should throw an exception and return an error to the caller/user.
If db insert occur, the producer could not emit the event to the event
bus.... So data is lost ? So shall the producer keep data in its local
storage for retrying ?
So in the case where you save your data/entity in the database and the queue is not available you could simply save the Message/Event to a table in the DB and then publish the Message/Event when the queue is up and running again. Actually this is a very common pattern in this situations. For example your implementation could be Transactional:
Save entity to its table and
Save event/message to Event table. This way if one fails the
operation will be rolled back.
You can have a background worker(depending on the tech you are using for your Back-end) to publish messages into the queue in async way.

Does using a transaction but not actually making any queries have a resource cost?

Ok so one of our team members has suggested that at the beginning of every http request we begin a DB transaction (we are using Entity Framework Core), do the work of the request, and then complete the transaction if the response is 200 Ok, or roll back if it is anything else.
This means we would only commit on successful requests.
That is well and good, when we perform reads and writes to the DB.
However I am wondering does this come at a cost, if we don't actually make any reads or writes to the db?

If you use TransactionScope for this then the transaction is only physically opened on the first database access. The cost for an unused scope is extremely low.
If you use normal EF transactions then an empty transaction will hit the database three times:
BEGIN TRAN
COMMIT
Reset connection for connection pooling
Each of these is extremely low cost. You can test the cost of this by simply running this 100000 times in a loop. It might very well be the case that you don't care about this small cost.
I still would advise against this. In my experience web applications require more flexibility than a 1:1 correspondence of web request and transaction. Also, the rule to use the HTTP status code to decide the transaction will turn out to be inflexible.
Also, you must pick an isolation level (and possibly timeout) for each transaction. At the beginning of an HTTP request it is not known what the right values are. Only the action knows.
I had good experiences with using one EF context per HTTP request and then manually using transactions inside of each action. The overhead in terms of LOC is very small. There is no pressing need to centralize this.

Don't blindly put BEGIN...COMMIT around everything. There are cases where this is just wrong.
What if the web page records the presence of the user, or the loading of the particular page? Having a ROLLBACK destroys that information.
What if there are two actions on the page, and they are independent of each other? That is a ROLLBACK for one is OK, but you want to COMMIT the other?
What if there are no writes on the page? Then there is no need for BEGIN...COMMIT.

What's a good way to handle "async" commits?

I have a WCF service that uses ODP.NET to read data from an Oracle database. The service also writes to the database, but indirectly, as all updates and inserts are achieved through an older layer of business logic that I access via COM+, which I wrap in a TransactionScope. The older layer connects to Oracle via ODBC, not ODP.NET.
The problem I have is that because Oracle uses a two-phase-commit, and because the older business layer is using ODBC and not ODP.NET, the transaction sometimes returns on the TransactionScope.Commit() before the data is actually available for reads from the service layer.
I see a similar post about a Java user having trouble like this as well on Stack Overflow.
A representative from Oracle posted that there isn't much I can do about this problem:
This maybe due to the way OLETx
ITransaction::Commit() method behaves.
After phase 1 of the 2PC (i.e. the
prepare phase) if all is successful,
commit can return even if the resource
managers haven't actually committed.
After all the successful "prepare" is
a guarantee that the resource managers
cannot arbitrarily abort after this
point. Thus even though a resource
manager couldn't commit because it
didn't receive a "commit" notification
from the MSDTC (due to say a
communication failure), the
component's commit request returns
successfully. If you select rows from
the table(s) immediately you may
sometimes see the actual commit occur
in the database after you have already
executed your select. Your select will
not therefore see the new rows due to
consistent read semantics. There is
nothing we can do about this in Oracle
as the "commit success after
successful phase 1" optimization is
part of the MSDTC's implementation.
So, my question is this:
How should I go about dealing with the possible delay ("asyc" via the title) problem of figuring out when the second part of the 2PC actually occurs, so I can be sure that data I inserted (indirectly) is actually available to be selected after the Commit() call returns?
How do big systems deal with the fact that the data might not be ready for reading immediately?

I assume that the whole transaction has prepared and a commit outcome decided by the TransactionManager, therefore eventually (barring heuristic damage) the Resource Managers will receive their commit message and complete. However, there are no guarantees as to how long that might take - could be days, no timeouts apply, having voted "commit" in the Prepare the Resource Manager must wait to hear the collective outcome.
Under these conditions, the simplest approach is to take "an understood, we're thinking" approach. Your request has been understood, but you actually don't know the outcome, and that's what you tell the user. Yes, in all sane circumstances the request will complete, but under some conditions operators could actually choose to intervene in the transaction manually (and maybe cause heuristic damage in doing so.)
To go one step further, you could start a new transaction and perform some queries to see if the data is there. Now, if you are populating a result screen you will naturally be doing such as query. The question would be what to do if the expected results are not there. So again, tell the user "your recent request is being processed, hit refresh to see if it's complete". Or retry automatically (I don't much like auto retry - prefer to educate the user that it's effectively an asynch operation.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio