Batching stores transparently - spring

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?

jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

Related

What is the best way to provide transaction operation if I use graphql

If I need to mutate multiple variables as part of a transaction, what is the best approach?
As far as I understand, grapQL will issue individual mutation command sequently behind the scene. What I need to do to make it transactional?
Thanks!
Is this a single mutation that performs multiple DB operations? If so, your mutation resolver is expected to perform all the operations, so it can manage the transaction itself: open, perform multiple operations, close.
If you're performing multiple mutations in a single request, they're always executed serially. In this case, you can simply open the transaction when you start GraphQL processing (e.g. in the controller or servlet or an instrumentation) and close it once you receive the result (or an exception). So the transaction management is around GraphQL execution, but outside of it.
If you want only certain operations from the request to be executed transactionally but not the others, you're already in the territory best left to frameworks. Use something like Spring's transaction management and let that deal with the correct lifecycle.
If you're, on the other hand, trying to implement a transaction spanning multiple distinct requests, this is really not something you should be doing, for many reasons, extremely convoluted transaction management being only one.

Kentico transactions and rollback of data

I am performing an import of data wrapped in a CMSTransactionScope.
What would be the most efficient and practical way to import data in parallel and rollback if any errors? The problem I see is that, with it being parallel, I don't know if I can have the inserted objects be part of the transaction if they are apart of a new thread.
Is there any way to do this or should it be handled differently?
If you're running the code in parallel in order to achieve better performance and you are basically inserting rows one by one then it's unlikely that it'll perform any better than it would while running in a single thread.
In this case I'd recommend using one thread in combination with CMSTransactionScope, and potentially ConnectionHelper.BulkInsert.
Anyway, if you still want to run your queries in parallel then you need to implement some kind of synchronization (locking, for instance) to ensure that all statements are executed before the code hits CMSTransactionScope.Commit() (this basically means a performance loss). Otherwise, queries would get executed in separate transactions. Moreover, you have to make sure that the CMSTransactionScope object always gets instantiated with the same IDataConnection (this should happen by default when you don't pass a connection to the constructor).
The second approach seems error prone to me and I'd rather take a look at different ways of optimizing the code (using async, etc.)

Real life scenarios for Transaction propagations

There are various transaction propagations like
REQUIRED - This is the case for DML operation.
SUPPORTS - This is the case for querying database.
MANDATORY - ?
REQUIRES_NEW - ?
NOT_SUPPORTED - ?
NEVER - ?
NESTED - ?
What are some real life scenarios for these transaction propagations? Why these are perfect fit into that situation?
There are various usages and there is no simple answer but I'll try to be the most explainatory
MANDATORY (expecting transaction): Typically used when you expect that any "higher-context" layer started the transaction and you only want to continue in it. For example if you want one transaction for whole operation from HTTP request to response. Then you start transaction on JAX-RS resource level and lower layers (services) require transaction to run within (otherwise exception is thrown).
REQUIRES_NEW (always create new transaction): This creates a new transaction and suspends the current one if any exists. From above example, this is the level you set on your JAX-RS resource for example. Or generally if your flow somehow changes and you want to split your logic into multiple transactions (so your code have mutliple logic operations that should be separated).
REQUIRED (continue in transaction or create if needed): Some kind of mix between MANDATORY and REQUIRES_NEW. In MANDATORY you expect that the transaction exists, for this level you hope that it exists and if not, you create it. Typically used in DAO-like services (from my experience), but it really depends on logic of your app.
SUPPORTS (caller decides whether to not/run in transaction): Used if want to use the same context as caller (higher context), if your caller was running in transaction, then you run in too. If it didn't, you are also non-transactional. May also be used in DAO-like services if you want higher context to decide.
NESTED (sub-transaction): I must admit I didn't use this one in real code but generally you create a real sub-transaction that works as some kind of checkpoint. So it runs in the context of "parent" transaction but if it fails it returns to that checkpoint (start of nested transaction). This may be useful when you require this kind of logic in your application, for example if you want to insert large number of items to the database, commiting valid ones and keeping track of invalid ones (so you can catch exception when nested transaction fails but can still commit the whole transaction for valid ones)
NEVER (expecting there is no transaction): This is the case when you want to be sure that no transaction exists at all. If some code that runs in transaction reaches this code, exception is thrown. So typically this is for cases completely opposite to MANDARTORY. E.g. when you know that no transaction should be affected by this code - very likely because the transaction should not exist.
NOT_SUPPORTED (continue non-transactionally): This is weaker than NEVER, you want the code to be run non-transactionally. If somehow you enter this code from context where transaction is, you suspend this transaction and continue non-transactionally.
From my experience, you very often want one business action to be atomic. Thus you want only one transaction per request/... For example simple REST call via HTTP that does some DB operations all in one HTTP-like transaction. So my typical usage is REQUIRES_NEW on the top level (JAX-RS Resource) and MANDATORY on all lower level services that are injected to this resource (or even lower).
This may be useful for you. It describes how code behave with given propagation (caller->method)
REQUIRED: NONE->T1, T1->T1
REQUIRES_NEW: NONE->T1, T1->T2
MANDATORY: NONE->Exception, T1->T1
NOT_SUPPORTED: NONE->NONE, T1->NONE
SUPPORTS: NONE->NONE, T1->T1
NEVER: NONE->NONE, T1->Exception

Spring,Hibernate - Batch processing of large amounts of data with good performance

Imagine you have large amount of data in database approx. ~100Mb. We need to process all data somehow (update or export to somewhere else). How to implement this task with good performance ? How to setup transaction propagation ?
Example 1# (with bad performance) :
#Singleton
public ServiceBean {
procesAllData(){
List<Entity> entityList = dao.findAll();
for(...){
process(entity);
}
}
private void process(Entity ent){
//data processing
//saves data back (UPDATE operation) or exports to somewhere else (just READs from DB)
}
}
What could be improved here ?
In my opinion :
I would set hibernate batch size (see hibernate documentation for batch processing).
I would separated ServiceBean into two Spring beans with different transactions settings. Method processAllData() should run out of transaction, because it operates with large amounts of data and potentional rollback wouldnt be 'quick' (i guess). Method process(Entity entity) would run in transaction - no big thing to make rollback in the case of one data entity.
Do you agree ? Any tips ?
Here are 2 basic strategies:
JDBC batching: set the JDBC batch size, usually somewhere between 20 and 50 (hibernate.jdbc.batch_size). If you are mixing and matching object C/U/D operations, make sure you have Hibernate configured to order inserts and updates, otherwise it won't batch (hibernate.order_inserts and hibernate.order_updates). And when doing batching, it is imperative to make sure you clear() your Session so that you don't run into memory issues during a large transaction.
Concatenated SQL statements: implement the Hibernate Work interface and use your implementation class (or anonymous inner class) to run native SQL against the JDBC connection. Concatenate hand-coded SQL via semicolons (works in most DBs) and then process that SQL via doWork. This strategy allows you to use the Hibernate transaction coordinator while being able to harness the full power of native SQL.
You will generally find that no matter how fast you can get your OO code, using DB tricks like concatenating SQL statements will be faster.
There are a few things to keep in mind here:
Loading all entites into memory with a findAll method can lead to OOM exceptions.
You need to avoid attaching all of the entities to a session - since everytime hibernate executes a flush it will need to dirty check every attached entity. This will quickly grind your processing to a halt.
Hibernate provides a stateless session which you can use with a scrollable results set to scroll through entities one by one - docs here. You can then use this session to update the entity without ever attaching it to a session.
The other alternative is to use a stateful session but clear the session at regular intervals as shown here.
I hope this is useful advice.

NHibernate IsUpdateNecessary takes enormous amount of time

My C# 3.5 application uses SQL Server 2008 R2, NHibernate and CastleProject ActiveRecord. The application imports emails to database along with their attachments. Saving of emails and attachments is performed by 50 emails in new session and transaction scope to make sure they are not stored in memory (there can be 100K of emails in some mailbox).
Initially emails are saved very quickly. However, closer to 20K emails performance degrades dramatically. Using dotTrace I got the following picture:
Obviously, when I save an attachment, NHibernate tries to see if it really should save it and probably compares with another attachments in the session. To do so, it compares them byte by byte what takes almost 500 seconds (for the snapshot on the picture) and 600M enumerator operations.
All this looks crazy, especially when I know for sure that SaveAndFlush indeed should save the attachment without any checks: I know for sure that it is new and should be saved.
However, I cannot figure out, how to instruct NHibernate to avoid this check (IsUpdateNecessary). Please advise.
P.S. I am not sure but it might appear that degradation of performance closer to 20K is not connected with having some older mails in memory: I noticed that in mailbox I am working with, larger emails are stored later than smaller so the problem may be only in attachments comparison.
Update:
Looks like I need something like StatelessSessionScope, but there is no documentation on it even at CastleProject site! If I do something like
using (TransactionScope txScope = new TransactionScope())
using (StatelessSessionScope scope = new StatelessSessionScope())
{
mail.Save();
}
it fails with exception that Save is not supported by stateless session. I am supposed to insert objects into session, but I do not have any Session (only SessionScope, which adds up to SessionScope only single OpenSession method which accepts strange paramenters).
May be I missed it in that long text, but are you using stateless session for importing data? Using that prevents a lot of checks and also bypasses first level cache, thus using minimal resources.
Looks like I've found an easy solution: for my class Attachment, causing biggest performance penalty, I overridden the following method:
protected override int[] FindDirty(
object id,
System.Collections.IDictionary previousState,
System.Collections.IDictionary currentState, NHibernate.Type.IType[] types)
{
return new int[0];
}
Thus, dirty check always consider it dirty and does not do that crazy per-byte comparison.

Resources