How to handle Transaction in CosmosDB - "All or nothing" concept

How to handle Transaction in CosmosDB - "All or nothing" concept - spring-boot

I am trying to save multiple Document to 'multiple Collection' at once in one Transaction. So if one of the save fail, then all the saved document should RollBack
I am using SpringBoot & SpringData and using MongoAPi to connect to CosmosDB in Azure. I have read in their portal that this can be done by writing some Stored procedure. But is there a way we can do it from code like how spring have #Transaction annotation.?
Any help is really appreciated.

The only way you can write transactionally is with a stored procedure, or via Transactional Batch operations (SDK-based, in a subset of the language SDKs, currently .NET and Java) . But that won't help in your case:
Stored procedures are specific to the core (SQL) API; they're not for the MongoDB API.
Transactions are scoped to a single partition within a single collection. You cannot transactionally write across collections, regardless whether you use the MongoDB API or the core (SQL) API.
You'll need really ask whether transactions are absolutely necessary. Maybe you can use some type of durable-messaging approach, to manage your content-updates. But there's simply nothing built-in that will allow you to do this natively in Cosmos DB.

Transactions across partitions and collections is indeed not supported. If you really need a Rollback mechanism, it might be worthwhile to check the event sourcing pattern, as you might then be able to capture events instead of updating master entities. These events you could then easily delete, but still other processes might have executed using incorrect events.
We created a sort of unitofwork. We register all changes to the data model, including events and messages that are being sent. Only when we call a committ, the changes are persisted to the database, in the following order:
Commit updates
Commit deletes
Commit inserts
Send messages
Send events
Still, it's not watertight, but it avoids sending out messages/events/modifications to the data model as long as the calling process is not ready to do so(i.e. due to an error). This UnitOfWork is passed through our domain services to allow all operations of our command to be handled in one batch. It's then up to thé developer to realize if a certain operation can be committed as part of a bigger operarion(same UoW), or independent(new UoW).
We then wrapped our command handlers in a Polly policy to retry in case of update conflicts. Theoretically though we could get an update conflict on the 2nd update, which could cause an inconsistent data model, but we keep this in mind, when using the UoW.
It's not watertight, but hopefully it helps!

Yes, transactions are supported in Cosmos DB with the Mongo API. It believe it's a fairly new addition, but it's simple enough and described here.
I don't know how well it's supported in Spring Boot, but at least it's doable.
// start transaction
var session = db.getMongo().startSession();
var friendsCollection = session.getDatabase("users").friends;
session.startTransaction();
// operations in transaction
try {
friendsCollection.updateOne({ name: "Tom" }, { $set: { friendOf: "Mike" } } );
friendsCollection.updateOne({ name: "Mike" }, { $set: { friendOf: "Tom" } } );
} catch (error) {
// abort transaction on error
session.abortTransaction();
throw error;
}
// commit transaction
session.commitTransaction();

Related

Batching stores transparently

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?

jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

Strategy handling and invalidating cached data on subscriptions in a moderately complex usecase

Let's take a chat application for example.
A user has access to multiple chat threads, each with multiple messages.
The user interface consists of a list of threads on the left side (listThreads) that contains the name of the other party, the last message, the unread message count and the date & time of the last message, and the actual messages (viewThread) and a reply box on the right hand side (think facebook messenger).
When the user selects a message thread, the viewThread component subscribes to a query something along the lines of:
query thread {
threads(id: 'xxxx') {
id
other_party { id name }
unread_messages
messages {
sent_by { id }
timestamp
text
}
}
To make updates live, it is setting q.subscriptToMore with a subscription along the lines of:
subcription newMessages {
newMessage(thread_id: 'xxx') {
sent_by { id }
timestamp
text
}
}
This works perfectly, the new messages show up as they should.
To list the available message threads, a less detailed view of all threads are queried:
query listThreads {
threads {
id
other_party { id name }
unread_messages
last_updated_at
}
}
To keep the list in sync the same subscription is used, without filtering on the thread_id, and the thread list data is updated manually
This also works fine.
However if a thread A is selected, the messages of thread A are cached.
If thread B is selected afterwards the subscription to the query getting the detailed info of thread A is destroyed since the observable is destroyed when the router excanges the viewThread component.
If then a message arrives to thread A while the user is viewing thread B, the threadList is updated (since that subscription is live), but if the user switches back to thread A, the messages are loaded from the cache which are now outdated, since there was no subscription for that particular message thread that would have updated or invalidated the cache.
In other circumstances where the user navigates to an entirely different page, where thread list would not be in view the problem is even more obvious, as there is nothing related to the chat messages that are actively subscribed to, so nothing to invalidate the cached data when a new message arrives, although the server theoretically provides a way to do that by offering new message subscription events.
My question is the following:
What are the best practices on keeping data in sync / invalidating that has been cached by Apollo, which are not actively "in use"?
What are the best practices on keeping nested data in sync (messages of threads of an event [see below]). I don't feel like having to implement the logic on how to subscribe to and update message data in the event query is a clean solution.
Using .subscribeToMore works for keeping data that is actively used in sync, but once that query is no longer in use the data remains in the cache which may or may not get outdated with time. Is there a way to remove cached data when an observable goes out of scope? As in keep this data cached as long as there is at least one query using it, because i trust that it also implements logic that will keep it in sync based on the server push events.
Should a service be used that subscribes (through the whole lifecycle of the SPA) to all subscription events and contains the knowledge on how to update each type of cached data, if present in the cache? (this service could be notified on what data needs to be kept in sync to avoid using more resources than necessary) (as in a service that subscribes to all newMessage events, and pokes the cache based on that)? Would that automatically emit new values for queries that have returned objects that have references to such data? (would updating message:1 make a thread query that returned the same message:1 in its messages field emit a new value automatically) Or those queries have to also be updated manually?
This starts to be very cumbersome when extending this model with say Events that also have their own chat thread, so querying event { thread { messages { ... } } now needs to subscribe to the newMessage subscription which breaks encapsulation and the single responsibility principle.
It is also problematic that to subscribe to newMessage data one would need to provide the id of the message thread associated with the event, but that is not known before the query returns. Due to this .subscribeToMore cannot be used, because at that point I don't have the thread_id available yet.

If the intended behavior is "every time I open a thread, show the latest messages and not just what's cached", then you just need to set the fetchPolicy for your thread query to network-only, which will ensure that the request is always sent to the server rather than being fulfilled from the cache. The docs for apollo-angular are missing information about this option, but here's the description from the React docs:
Valid fetchPolicy values are:
cache-first: This is the default value where we always try reading data from your cache first. If all the data needed to fulfill your query is in the cache then that data will be returned. Apollo will only fetch from the network if a cached result is not available. This fetch policy aims to minimize the number of network requests sent when rendering your component.
cache-and-network: This fetch policy will have Apollo first trying to read data from your cache. If all the data needed to fulfill your query is in the cache then that data will be returned. However, regardless of whether or not the full data is in your cache this fetchPolicy will always execute query with the network interface unlike cache-first which will only execute your query if the query data is not in your cache. This fetch policy optimizes for users getting a quick response while also trying to keep cached data consistent with your server data at the cost of extra network requests.
network-only: This fetch policy will never return you initial data from the cache. Instead it will always make a request using your network interface to the server. This fetch policy optimizes for data consistency with the server, but at the cost of an instant response to the user when one is available.
cache-only: This fetch policy will never execute a query using your network interface. Instead it will always try reading from the cache. If the data for your query does not exist in the cache then an error will be thrown. This fetch policy allows you to only interact with data in your local client cache without making any network requests which keeps your component fast, but means your local data might not be consistent with what is on the server. If you are interested in only interacting with data in your Apollo Client cache also be sure to look at the readQuery() and readFragment() methods available to you on your ApolloClient instance.
no-cache: This fetch policy will never return your initial data from the cache. Instead it will always make a request using your network interface to the server. Unlike the network-only policy, it also will not write any data to the cache after the query completes.

How to rollback distributed transactions?

I have three different Spring boot Projects with separated databases e.g account-rest, payment-rest, gateway-rest.
account-rest : create a new account
payment-rest : create a new payment
gateway-rest : calls other endpoints
at gateway-rest there is an endpoint which calls the other two endpoints.
#GetMapping("/gateway-api")
#org.springframework.transaction.annotation.Transactional(rollbackFor = RuntimeException.class)
public String getApi()
{
String accountId = restTemplate.getForObject("http://localhost:8686/account", String.class);
restTemplate.getForObject("http://localhost:8585/payment?accid="+accountId, String.class);
throw new RuntimeException("rollback everything");
}
I want to rollback transactions and revert everything when I throw exception at gateway or anyother endpoints.
How can I do that ?

It is impossible rollback external dependencies accessible via rest or something like that.
The only think that you can do is compensate errors, you can use pattern like SAGA
I hope that is can help you

You are basically doing dual persistence. That's not ideally a good thing because of 2 reasons
It increases the latency and thus have a direct impact on user experience
What if one of them fails?
As the other answer pointed out SAGA pattern is an option to post compensation transaction.
The other option and it's better to go with this by all means is to avoid dual persistence by writing to only one service synchronously and then use Change Data Capture (CDC) to asynchronously upate the other service. If we can design in this way, we can ensure atomicity (all or nothing) and thus probably the rollback scenario itself will not surface.
Refer to these two answers also, if they help:
https://stackoverflow.com/a/54676222/1235935
https://stackoverflow.com/a/54527066/1235935
By all means avoid distributed transactions or 2-phase commit. It's not a good solution and creates lot of operational overhead, locking etc. when the transaction co-ordinator fails after prepare phase and before commit phase. Worse things happen when transaction co-ordinator gets its data corrupted.

For that purpose you need external transaction management system. It will handle distributed transations and commit/rollback when its finished on all services.
Possible flow example:
Request coming
gateway-rest starts a distributed transaction and local transaction and sends a request(with transaction id) to payment-rest. Thread with transaction lives until all local transactions is finished.
payment-rest knows about global transaction and starts own local transaction.
When all local transactions marked as commited, TM(transaction manager) sends a request to each service to close local transactions and close global transaction.

In your case you can use sagas as mentioned by many others, but they require events and async in nature.
if you want a sync kind of API. you can do something similar to this:
first lets take an example in case of amazon, for creating a order and getting balance out of your wallet and completing the order:
create Order in PendingState
reserveBalance in Account service for order id
if balance reserved change Order state to Confirmed (also having the transaction id for the reserve) and update reserveBalanceConsumed to Account Service
else change Order state to Cancelled with reason , "not enough Balance"
Now there are cases where lets says account service balance is reserved but for some reason order is either not confirmed.
Then somebody could periodically check that if there are reserve Balance for some order and time>30 min let say then check whether that order is marked as confirmed with that trnasaction id , call reserveBalanceConsumed , else cancel that order with reason "some error please try again" , mark balance as free
NOW THESE TYPE OF SYSTEMS ARE COMPLEX TO BUILD. Use the Saga pattern in general for simpler structure.

Multiple Publish inside a transactionScope fails

I modeled an integration architecture between different subsystems. All notification from a subsystem are sent to the subscribed subsystems using the primitive Publish. These notifications are sent in a for loop inside an Handler method, so they are all in the same TransactionScope. I did a simple example to explain this: A client send a message to a server that send a variable number of messages using the primitive Publish. This is the server handler:
public void Handle(MyMessage message)
{
for (int i = 0; i < message.numberOfNotifications; i++)
{
Bus.Publish<NotificationMessage>(m =>
{
m.myPersonalCount= i;
}
);
}
}
What I'm looking and I can't figure out is that when I set i to 30 or less everything is OK. From 31 or more I get this error message:
could not execute query
[ SELECT this_.SubscriberEndpoint as y0_ FROM "Subscription" this_ WHERE this_.MessageType in (?) ]
And looking in the inner exception I get Unable to enlist in a distributed transaction.
I tried the same using the primitive Send but everything was(tried with 10k messages), so this is a problem pertinent only to the Publish directive.
I use Oracle 10g for dbms and Oracle 11g for client.
If the endpoint is not transactional I don't have any problem and so the problem seems to concern only with the TransactionScope.
Any help is appreciated, Thanks

I am not an Oracle expert by any stretch of the imagination. I probably wouldn't even qualify as novice.
However, I know that NServiceBus queries the subscription storage for every single publish, in case there are changes to subscriptions between publishes.
Is it possible that the Oracle client has some sort of limitation to how many queries can enlist in a distributed transaction? Perhaps as a way to prevent N+1 types of performance problems?
That said, it seems very odd that you would want to publish more than 30 of the same type of event. I wonder what your business use case is. Events are generally supposed to announce that something irrevocable has happened. Why would there be 30 things that have happened?
If the business case is solid, it might be a good idea to implement your own subscription storage engine that uses some limited caching (5 seconds even) so that you aren't returning to querying the database on every single publish.

Spring,Hibernate - Batch processing of large amounts of data with good performance

Imagine you have large amount of data in database approx. ~100Mb. We need to process all data somehow (update or export to somewhere else). How to implement this task with good performance ? How to setup transaction propagation ?
Example 1# (with bad performance) :
#Singleton
public ServiceBean {
procesAllData(){
List<Entity> entityList = dao.findAll();
for(...){
process(entity);
}
}
private void process(Entity ent){
//data processing
//saves data back (UPDATE operation) or exports to somewhere else (just READs from DB)
}
}
What could be improved here ?
In my opinion :
I would set hibernate batch size (see hibernate documentation for batch processing).
I would separated ServiceBean into two Spring beans with different transactions settings. Method processAllData() should run out of transaction, because it operates with large amounts of data and potentional rollback wouldnt be 'quick' (i guess). Method process(Entity entity) would run in transaction - no big thing to make rollback in the case of one data entity.
Do you agree ? Any tips ?

Here are 2 basic strategies:
JDBC batching: set the JDBC batch size, usually somewhere between 20 and 50 (hibernate.jdbc.batch_size). If you are mixing and matching object C/U/D operations, make sure you have Hibernate configured to order inserts and updates, otherwise it won't batch (hibernate.order_inserts and hibernate.order_updates). And when doing batching, it is imperative to make sure you clear() your Session so that you don't run into memory issues during a large transaction.
Concatenated SQL statements: implement the Hibernate Work interface and use your implementation class (or anonymous inner class) to run native SQL against the JDBC connection. Concatenate hand-coded SQL via semicolons (works in most DBs) and then process that SQL via doWork. This strategy allows you to use the Hibernate transaction coordinator while being able to harness the full power of native SQL.
You will generally find that no matter how fast you can get your OO code, using DB tricks like concatenating SQL statements will be faster.

There are a few things to keep in mind here:
Loading all entites into memory with a findAll method can lead to OOM exceptions.
You need to avoid attaching all of the entities to a session - since everytime hibernate executes a flush it will need to dirty check every attached entity. This will quickly grind your processing to a halt.
Hibernate provides a stateless session which you can use with a scrollable results set to scroll through entities one by one - docs here. You can then use this session to update the entity without ever attaching it to a session.
The other alternative is to use a stateful session but clear the session at regular intervals as shown here.
I hope this is useful advice.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio