Spring,Hibernate - Batch processing of large amounts of data with good performance - spring

Imagine you have large amount of data in database approx. ~100Mb. We need to process all data somehow (update or export to somewhere else). How to implement this task with good performance ? How to setup transaction propagation ?
Example 1# (with bad performance) :
#Singleton
public ServiceBean {
procesAllData(){
List<Entity> entityList = dao.findAll();
for(...){
process(entity);
}
}
private void process(Entity ent){
//data processing
//saves data back (UPDATE operation) or exports to somewhere else (just READs from DB)
}
}
What could be improved here ?
In my opinion :
I would set hibernate batch size (see hibernate documentation for batch processing).
I would separated ServiceBean into two Spring beans with different transactions settings. Method processAllData() should run out of transaction, because it operates with large amounts of data and potentional rollback wouldnt be 'quick' (i guess). Method process(Entity entity) would run in transaction - no big thing to make rollback in the case of one data entity.
Do you agree ? Any tips ?

Here are 2 basic strategies:
JDBC batching: set the JDBC batch size, usually somewhere between 20 and 50 (hibernate.jdbc.batch_size). If you are mixing and matching object C/U/D operations, make sure you have Hibernate configured to order inserts and updates, otherwise it won't batch (hibernate.order_inserts and hibernate.order_updates). And when doing batching, it is imperative to make sure you clear() your Session so that you don't run into memory issues during a large transaction.
Concatenated SQL statements: implement the Hibernate Work interface and use your implementation class (or anonymous inner class) to run native SQL against the JDBC connection. Concatenate hand-coded SQL via semicolons (works in most DBs) and then process that SQL via doWork. This strategy allows you to use the Hibernate transaction coordinator while being able to harness the full power of native SQL.
You will generally find that no matter how fast you can get your OO code, using DB tricks like concatenating SQL statements will be faster.

There are a few things to keep in mind here:
Loading all entites into memory with a findAll method can lead to OOM exceptions.
You need to avoid attaching all of the entities to a session - since everytime hibernate executes a flush it will need to dirty check every attached entity. This will quickly grind your processing to a halt.
Hibernate provides a stateless session which you can use with a scrollable results set to scroll through entities one by one - docs here. You can then use this session to update the entity without ever attaching it to a session.
The other alternative is to use a stateful session but clear the session at regular intervals as shown here.
I hope this is useful advice.

Related

optimizing findAll in spring Data JPA

I have a table which has a list of lookup values max 50 rows.
Currently, I am querying this table every time to look for a particular value which is not efficient.
So, I am planning to optimize this by loading all the value at once as a List from the repository using findAll.
List<CardedList> findAll();
My question here is
Class A -> Class B - Class B which holds this repository. Will it query findAll everytime when Class A calls Class B?
Class A {
//foreach items in the list call Class B
b.someMethod();
}
Class B {
#Autowired
CardedListRepository cardRepo;
someMethod() {
cardRepo.findAll();
}
}
What is the best way to achieve this?
If it is just 50 rows you could cache them in an instance variable of a service and check like this:
Class B {
#Autowired
CardedListRepository cardRepo;
List<CardedList> cardedList = new ArrayList<>();
someMethod() {
if(cardedList.isEmpty())
{
cardedList = cardRepo.findAll();
}
// do others in someMethod
}
The proposed "solution" by #Juliyanage Silva (to "cache" the findAll query result as a simple instance variable of service B) can be very dangerous and should not be implemented before checking very carefully that it works under all circumstances.
Just imagine the same service instance being called from a subsequent transaction - you would end up with a (probably outdated) list of detached entities.
(e.g. leading to LazyInitializationExceptions when accessing not initialized properties, etc.)
Hibernate already provides several caching mechanisms, as e.g. a standard first level cache, which avoids unnecessary DB round trips when looking for an already loaded entity by ID within the same transaction.
However, query results (as from findAll) are not cached by default, as explained in the documentation:
Caching of query results introduces some overhead in terms of your applications normal transactional processing. For example, if you cache results of a query against Person, Hibernate will need to keep track of when those results should be invalidated because changes have been committed against any Person entity.
That, coupled with the fact that most applications simply gain no benefit from caching query results, leads Hibernate to disable caching of query results by default.
To enable the Hibernate query cache, the second level cache needs to be configured. To prevent ending up with stale entries when having multiple application instances, this calls for a distributed cache (like Hazelcast or EhCache).
There are also various discussions on using springs caching mechanisms for this purpose. However, there are also various pitfalls when it comes to caching collections. And when running multiple application instances you may need a distributed cache or another global invalidation mechanism, too.
How to add cache feature in Spring Data JPA CRUDRepository
Spring Cache with collection of items/entities
Spring Caching not working for findAll method
So depending on your use-case, it may be the easiest to just avoid unnecessary calls of service B by storing the result in a local variable within the calling method of service A.

Spring JPA: keeping the persistence context small

How can I keep the persistence context small in a Spring JPA environment?
Why: I know that by keeping the persistence context small, there will be a significant performance boost!
The main problem area is:
#Transactional
void MethodA() {
WHILE retrieving next object of 51M (via a stateless session connection) DO
get some further (readonly) data
IF condition holds THEN
assessment = retrieve assession object (= record from database)
change assessment data
save the assessment to the database
}
Via experiments in this problem domain I know that when cleaning the persistence context every 250 iterations, then the performance will be a lot better.
When I add these lines to the code, so every 250 iterations:
#PersistenceContext
private EntityManager em;
WHILE ...
...
IF counter++ % 250 == 0 THEN
em.flush()
em.clear()
}
Then I get errors like "cannot reliably perform the flush operation".
I tried to make the main Transactional read-only and the asssessment-save part 'Transaction-requires-new', then I get errors like 'operating on a detached entity'. Very strange, because I never revisit an open entity.
So, how can I keep the persistence context small?
Have tried 10s of ways. Help is really appreciated.
I would suggest you move all the condition logic into your query so that you don't even have to load that many rows/objects. Or even better, write an update query that does all of that in a single transaction so you don't need to transfer any data at all between your application and database.
I don't think that flushing is necessary with a stateless session as it doesn't keep state i.e. it flushes and clears the persistence context after every operation, but apart from that, I also think this this might not be what you really want as that could lead to re-fetching of data.
If you don't want the persistence context to fill up, then use DTOs for fetching the data and execute update statements to flush the changes.

Batching stores transparently

We are using the following frameworks and versions:
jOOQ 3.11.1
Spring Boot 2.3.1.RELEASE
Spring 5.2.7.RELEASE
I have an issue where some of our business logic is divided into logical units that look as follows:
Request containing a user transaction is received
This request contains various information, such as the type of transaction, which products are part of this transaction, what kind of payments were done, etc.
These attributes are then stored individually in the database.
In code, this looks approximately as follows:
TransactionRecord transaction = transactionRepository.create();
transaction.create(creationCommand);`
In Transaction#create (which runs transactionally), something like the following occurs:
storeTransaction();
storePayments();
storeProducts();
// ... other relevant information
A given transaction can have many different types of products and attributes, all of which are stored. Many of these attributes result in UPDATE statements, while some may result in INSERT statements - it is difficult to fully know in advance.
For example, the storeProducts method looks approximately as follows:
products.forEach(product -> {
ProductRecord record = productRepository.findProductByX(...);
if (record == null) {
record = productRepository.create();
record.setX(...);
record.store();
} else {
// do something else
}
});
If the products are new, they are INSERTed. Otherwise, other calculations may take place. Depending on the size of the transaction, this single user transaction could obviously result in up to O(n) database calls/roundtrips, and even more depending on what other attributes are present. In transactions where a large number of attributes are present, this may result in upwards of hundreds of database calls for a single request (!). I would like to bring this down as close as possible to O(1) so as to have more predictable load on our database.
Naturally, batch and bulk inserts/updates come to mind here. What I would like to do is to batch all of these statements into a single batch using jOOQ, and execute after successful method invocation prior to commit. I have found several (SO Post, jOOQ API, jOOQ GitHub Feature Request) posts where this topic is implicitly mentioned, and one user groups post that seemed explicitly related to my issue.
Since I am using Spring together with jOOQ, I believe my ideal solution (preferably declarative) would look something like the following:
#Batched(100) // batch size as parameter, potentially
#Transactional
public void createTransaction(CreationCommand creationCommand) {
// all inserts/updates above are added to a batch and executed on successful invocation
}
For this to work, I imagine I'd need to manage a scoped (ThreadLocal/Transactional/Session scope) resource which can keep track of the current batch such that:
Prior to entering the method, an empty batch is created if the method is #Batched,
A custom DSLContext (perhaps extending DefaultDSLContext) that is made available via DI has a ThreadLocal flag which keeps track of whether any current statements should be batched or not, and if so
Intercept the calls and add them to the current batch instead of executing them immediatelly.
However, step 3 would necessitate having to rewrite a large portion of our code from the (IMO) relatively readable:
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
to:
userObjects.forEach(userObject -> {
dslContext.insertInto(...).values(userObject.getX(), ...).execute();
}
which would defeat the purpose of having this abstraction in the first place, since the second form can also be rewritten using DSLContext#batchStore or DSLContext#batchInsert. IMO however, batching and bulk insertion should not be up to the individual developer and should be able to be handled transparently at a higher level (e.g. by the framework).
I find the readability of the jOOQ API to be an amazing benefit of using it, however it seems that it does not lend itself (as far as I can tell) to interception/extension very well for cases such as these. Is it possible, with the jOOQ 3.11.1 (or even current) API, to get behaviour similar to the former with transparent batch/bulk handling? What would this entail?
EDIT:
One possible but extremely hacky solution that comes to mind for enabling transparent batching of stores would be something like the following:
Create a RecordListener and add it as a default to the Configuration whenever batching is enabled.
In RecordListener#storeStart, add the query to the current Transaction's batch (e.g. in a ThreadLocal<List>)
The AbstractRecord has a changed flag which is checked (org.jooq.impl.UpdatableRecordImpl#store0, org.jooq.impl.TableRecordImpl#addChangedValues) prior to storing. Resetting this (and saving it for later use) makes the store operation a no-op.
Lastly, upon successful method invocation but prior to commit:
Reset the changes flags of the respective records to the correct values
Invoke org.jooq.UpdatableRecord#store, this time without the RecordListener or while skipping the storeStart method (perhaps using another ThreadLocal flag to check whether batching has already been performed).
As far as I can tell, this approach should work, in theory. Obviously, it's extremely hacky and prone to breaking as the library internals may change at any time if the code depends on Reflection to work.
Does anyone know of a better way, using only the public jOOQ API?
jOOQ 3.14 solution
You've already discovered the relevant feature request #3419, which will solve this on the JDBC level starting from jOOQ 3.14. You can either use the BatchedConnection directly, wrapping your own connection to implement the below, or use this API:
ctx.batched(c -> {
// Make sure all records are attached to c, not ctx, e.g. by fetching from c.dsl()
records.forEach(record -> {
record.setX(...);
// ...
record.store();
}
});
jOOQ 3.13 and before solution
For the time being, until #3419 is implemented (it will be, in jOOQ 3.14), you can implement this yourself as a workaround. You'd have to proxy a JDBC Connection and PreparedStatement and ...
... intercept all:
Calls to Connection.prepareStatement(String), returning a cached proxy statement if the SQL string is the same as for the last prepared statement, or batch execute the last prepared statement and create a new one.
Calls to PreparedStatement.executeUpdate() and execute(), and replace those by calls to PreparedStatement.addBatch()
... delegate all:
Calls to other API, such as e.g. Connection.createStatement(), which should flush the above buffered batches, and then call the delegate API instead.
I wouldn't recommend hacking your way around jOOQ's RecordListener and other SPIs, I think that's the wrong abstraction level to buffer database interactions. Also, you will want to batch other statement types as well.
Do note that by default, jOOQ's UpdatableRecord tries to fetch generated identity values (see Settings.returnIdentityOnUpdatableRecord), which is something that prevents batching. Such store() calls must be executed immediately, because you might expect the identity value to be available.

EJB weblogic.ejb20.cache.CacheFullException

I am working on one application using EJB1.2. previously running fine but from past few days I am getting following exception
Exception in ejbLoad:: weblogic.ejb20.cache.CacheFullException: size=85783, target=5000, incr=1 at weblogic.ejb20.cache.EntityCache$SizeTracker.shrinkNext(JI)Lweblogic.ejb20.cache.EntityCache$MRUElement;(EntityCache.java:438) at weblogic.ejb20.cache.EntityCache.put
(Ljavax.transaction.Transaction;Lweblogic.ejb20.cache.CacheKey;Ljavax.ejb.EntityBean;Lweblogic.ejb20.interfaces.CachingManager;)V(EntityCache.java:141) at weblogic.ejb20.manager.DBManager.getReadyBean(Ljavax.transaction.Transaction;Ljava.lang.Object;)Ljavax.ejb.EntityBean;(DBManager.java:332) at
weblogic.ejb20.manager.DBManager.preInvoke(Lweblogic.ejb20.internal.InvocationWrapper;)Ljavax.ejb.EnterpriseBean;(DBManager.java:249) at
weblogic.ejb20.internal.BaseEJBLocalObject.preInvoke(Lweblogic.ejb20.internal.InvocationWrapper;)Lweblogic.ejb20.internal.InvocationWrapper;(BaseEJBLocalObject.java:228) at weblogic.ejb20.internal.EntityEJBLocalObject.preInvoke(Lweblogic.ejb20.internal.MethodDescriptor;Lweblogic.security.service.ContextHandler;)Lweblogic.ejb20.internal.InvocationWrapper;(EntityEJBLocalObject.java:72) at com.nextjet.enterprise.locationcode.locationcode.LocationCode_v2epgs_ELOImpl.getLocationCodeData()Lcom.nextjet.enterprise.locationcode.LocationCodeData;(LocationCode_v2epgs_ELOImpl.java:28) at com.nextjet.enterprise.locationcode.locationcodemanager.LocationCodeManagerBean.loadShippingAddress(Ljava.lang.Long;Ljava.lang.String;)Lcom.nextjet.enterprise.locationcode.LocationCodeView;(LocationCodeManagerBean.java:538) at com.nextjet.enterprise.locationcode.locationcodemanager.LocationCodeManagerBean.doSearchShippingAddresses(Ljava.lang.String;)Lcom.nextjet.enterprise.locationcode.LocationCodeSearchResult;(LocationCodeManagerBean.java:514) at com.nextjet.enterprise.locationcode.locationcodemanager.LocationCodeManagerBean.lookupAccountShipping.....
For now I am changing value of <max-beans-in-cache> in weblogic-ejb-jar.xml
I am changing the above value to <max-beans-in-cache>100000</max-beans-in-cache>
is it the only solution for this kind of exception or could there be a data related issue from database?
10000 is quite a high value for max-beans-in-cache and from the log it seems the application tried to make a call for up to 85785 instances of the EJB.
I would suggest some refactoring in your code.
Your code is doing
com.nextjet.enterprise.locationcode.locationcode.LocationCode_v2epgs_ELOImpl
.getLocationCodeData()
Is this is mainly a read operation ? Or are you doing simultaneous writes and reads?
You could refactor this in 2 ways to reduce the EJB overheads if it is mainly doing read operations.
1) Read Oracle's recommendation on tuning the EJB settings and database options for concurrency, especially the Read-Mostly pattern
http://docs.oracle.com/cd/E13222_01/wls/docs81/ejb/entity.html#ChoosingaConcurrencyStrategy
2) If you are mainly doing reads - then dont use Entity EJBs at all. Use the FastLaneReader pattern which is using a direct JDBC call to fetch the data for SELECT, and you can do writes using the EJBs as at present. In this way, the max-beans-in-cache can be reduced
A very detailed example is given on the Sun Design Patterns site
http://java.sun.com/blueprints/patterns/FastLaneReader.html

Entity Framework and ObjectContext n-tier architecture

I have a n-tier application based on pretty classic different layers: User Interface, Services (WCF), Business Logic and Data Access.
Database (Sql Server) is obviously quered throught Entity Framework, the problem is basically that every call starts from user interface and go throught all the layers, but doing that I need to create a new ObjectContext each time for every operation and that makes performance very bad because every time I need to reload metadata and recompile the query.
The most suggested pattern it would be the one below and it is what I'm actually doing: creating and passing the new context throught business layer methods each time the service receives a call
public BusinessObject GetQuery(){
using (MyObjectContext context = new MyObjectContext()){
//..do something } }
For easy query I don't see any particular dealy and it works fine but for complex and heavy query it makes a 2 seconds query to keep going for like 15 seconds each call.
I could set the ObjectContext static and it would solve the performance issue but it appears to be not suggested by anyone, also because I won't be able to access the context at the same time from different thread and multiple calls raise an exception. I could make it thread-safe but mantain the same ObjectContext for long time makes it bigger and bigger (and slower) because the reference it imports each query it execute a query.
The architecture I have I think it is the most common so what is the best and known way to implement and use ObjectContext?
Thank you,
Marco
In a Web context, it's best to use a stateless approach and create an ObjectContext for each request.
The cost of ObjectContext construction are minimal. The metadata is loaded from a global cache so only the first call will have to load it.
Static is definitely not a good idea. The ObjectContext is not thread save and this will lead to problems when using it in a WCF service with multiple calls. Making it thread save will result in less performance and it can cause subtle errors when reusing it in multiple requests.
Check this info: How to decide on a lifetime for your ObjectContext
Working with a static object context is not a good idea. A static context will be shared by all users of the web application meaning that when one user makes modifications to a context such as calling saveChanges , all other users using the context will be affected (this would be a problem when supposing they have added or updated data to the context but have not called save changes). The best practice while working with object context is to keep it alive for the period of the request and use if to perform any atomic business operations. You would want to check out the UnitOfWork pattern and repository pattern
uow
uow and repository in EF
If you feel you are having performance issues with your queries and there is a possibility that you would reuse your query , I would recommend you use precompiled linq queries. You can check out the links below for more info
precompiled linq julie lermann
precompiled linq
What you show is the typical pattern to use a context - by request, similar to using a database connection.
What makes you think the bad performance is related to recreating the context? This is very, very likely not the case. How did you measure this impact?
If you have such performance critical code that this overhead truly matters you should not use Entity Framework since there always will be some overhead, even if the overhead should be very little in the general case. I would start focusing on your data model though and the underlying data store which will have a much larger impact on your query performance. Have you optimized your queries? Did you put indexes everywhere you need them? Can you de-normalize the data to remove joins?

Resources