We have a parent object with a collection of 500.000 child objects. We are using Hibernate for mapping with ehcache as the cache provider. Using the 2nd level cache for entities and collection works fine as we can avoid requests to the database.
But loading 500.000 objects by 2nd level cache still produces a lot of cpu and memory garbage and results in a reponse time of a few seconds. As the child objects are not immutable we can't enable the hibernate.cache.use_reference_entries property.
With using an application layer cache of dao objects in top of the hibernate 2nd level cache, there's no cpu and no garbage memory overhead. The response time is a few milliseconds instead of seconds.
But the big disadvantage of this solution is, that we have to manage this cache by ourself. Including invalidation und synchronization in a clustered multithreading system.
My question is, if there's a better solution with the advantages of low cpu and garbage? Does anyone have experience in handling large collections?
Do you really need that 500k at once?
You could remove the collection from the Parent and query the objects from Child by parent: SELECT c FROM Child c WHERE c.parent = :parent and add pagination or filtering when you dont need the 500k at once.
You could also load the data Child entieties as DTO which would improve memory performance because hibernate would not consider these DTO's for dirty checking. I guess this would remove memory footprint by half, although i never benchmarked it. Also a DTO would allow you to omit attributes which you dont need in this certain use case saving memory and CPU.
You could also take a look at enableDirtyTracking in Hibernate 5.
Related
So my understanding around hibernate first level cache was that it is around sessions and transactions. Items remain in the cache during a transaction, but then once a transaction is closed ie request fulfilled it will clean/evict items.
But I wondered if that is wrong does the first level cache keep items after a request has been fulfilled and subsequent GET API requests go to the cache. Is there a time limit when it evicts objects from the cache.
This is in Spring boot.
Your description of the first level cache is correct. It's per session/transaction. After the transaction is finished, the objects are left to be garbage collected.
To cache entities across sessions one needs to use the second level cache.
Using this can become a bit tricky for applications with multiple instances; depending how the application is built, one might need to use a distributed cache to have the cache in sync across instances of the application.
I have a table with millions of rows (with 98% reads, maybe 1 - 2% writes) which has references to couple of other config tables (with maybe 20 entries each). What are the best practices for caching the tables in this case? I cannot cache the table with millions of rows. But at the same time, I also don't want to hit the DB for the config tables. Is there a work around for this? I'm using Spring boot, and the data is in postgres.
Thanks.
First of all, let me refer to this:
What are the best practices for caching the tables in this case
I don't think you should "cache tables" as you say. In the Application, you work with the data, and this is what should be cached. This means the object that you cache should be already in a structure that includes these relations. Of course, in order to fetch the whole object from the database, you can use JOINs, but when the object gets cached, it doesn't matter already, the translation from Relational model to the object model was done.
Now the question is too broad because the actual answer can vary on the technologies you use, nature of data, and so forth.
You should answer the following questions before you design the cache (the list is out my head, but hopefully you'll get the idea):
What is the cache invalidation strategy? You say, there are 2% writes, what happens if the data gets updated, the data in the cache may become stale. Is it ok?
A kind of generalization of the previous question: If you have multiple instances (JVMs) of the same application, and one of them triggered the update to the DB data, what should happen to other apps' caches?
How long the stale/invalid data can reside in the cache?
Do the use cases of your application access all the data from the tables with the same frequencies or some data is more "interesting" (for example, the oldest data is not read, but the latest data is always "hot")? Probably if its millions of data for configuration, the JVM doesn't have all these objects in the heap at the same time, so there should be some "slice" of this data...
What are the performance implications of having the cache? How does it affect the GC behavior?
What technologies can be used in your case (maybe due to some regulations/licensing, some technologies are just not available, this is more a case in large organizations)
Based on these observations you can go with:
In-memory cache:
Spring integrates with various in-memory cache technologies, you can also use them without spring at all, to name a few:
Google Guava cache (for older spring cache implementations)
Coffeine (for newer spring cache implementations)
In memory map of key / value
In memory but in another process:
Redis
Infinispan
Now, these caches are slower than those listed in the previous category but still can
be significantly faster than the DB.
Data Grids:
Hazelcast
Off heap memory-based caches (this means that you store the data off-heap, so its not eligible for garbage collection)
Postgres related solutions. For example, you can still go to db, but since you can opt for keeping the index in-memory the queries will be significantly faster.
Some ORM mapping specific caches (like hibernate has its cache as well).
Some kind of mix of all above.
Implement your own solution - well, this is something that probably you shouldn't do as the first attempt to address the issue, because caching can be tricky.
In the end, let me provide a link to some very interesting session given by Michael Plod about caching. I believe it will help you to find the solution that works for you best.
I am having performance problems where an aggregate has a bag which has a large number of entities (1000+). Usually it only contains at most 50 entities but sometimes a lot more.
Using NHibernate profiler I see that the duration to fetch 1123 records of this bag from the database is 18ms but it takes NHibernate 1079ms to process it. Problem here is that all those 1123 records have one or two additional records. I fetch these using fetch="subselect" and fetching these additional records takes 16ms to fetch from the database and 2527ms processing by NHibernate. So this action alone takes 3,5 seconds which is way too expensive.
I read that this is due the fact that updating the 1st level cache is the problem here as it performance gets slow when loading a lot of entities. But what is alot? NHibernate Profiler says that I have 1145 entities loaded by 31 queries (which is in my case the absolute minimum). This number of entities loaded does not seem like a lot to me.
In the current project we are using NHibernate v3.1.0.4000
I agree, 1000 entities aren't too many. Are you sure that the time isn't used in one of the constructors or property setters? You may stop the debugger during the load time to take a random sample where it spends the time.
Also make sure that you use the reflection optimizer (I think it's turned on by default).
I assume that you measure the time of the query itself. If you measure the whole transaction, it most certainly spends the time in flushing the session. Avoid flushing by setting the FlushMode to Never (only if there aren't any changes in the session to be stored) or by using a StatelessSession.
A wild guess: Removing the batch-size setting may even make it faster because it doesn't need to assign the entities to the corresponding collections.
We are using a PostgreSQL database and AppFabric Server, running a moderately busy ASP.NET MVC e-commerce site.
Following the cache-aside pattern we request data from our cache, and if it is not available, we query the database.
This approach results in 'query storms' where the database recieves multiple queries for the same data in a short space of time, while a given object in the cache is being refreshed. This issue is exacerbated by longer running queries, and obviously multiple requests for the same data can cause the query to run longer, forming an unpleasant feedback loop.
One solution to this problem is to use read-locking on the cache. However this can itself cause performance issues in a web farm situation (or even on a single busy web server) as web servers are blocked on reads for no reason, in case there is a database query taking place.
Another solution is to drop the cache-aside pattern and seed the cache independently. This is the approach we have taken to mitigate the immediate issues we are seeing with this problem, however it is not possible with all data.
Am I missing something here? And what other approaches have people taken to avoid this behaviour?
Depending on the number of servers you have and your current cache architecture it may be worthwhile to evaluate adding a server-level (or in-process) cache as well. In effect you use this as a fallback cache, and it's especially helpful where hitting the primary storage (database) is either very resource intensive or slow.
When I've used this I've used the cache-aside pattern for the primary cache and a read-through design for the secondary--in which the secondary is locking and ensures the database isn't over-saturated by the same request. With this architecture a primary cache-miss results in at most one query per entity per server (or process) to the database.
So the basic workflow is:
1) Try to retrieve from primary / shared cache pool
* If successful, return
* If unsuccessul, continue
2) Check in-process cache for value
* If successful, return (optionally seeding primary cache)
* If unsuccessul, continue
3) Get lock by cache key (and double-check in-process cache, in case it's been added by another thread)
4) Retrieve object from primary persistence (db)
5) Seed in-process cache and return
I've done this using injectable wrappers, my cache layers all implement the relevant IRepository interface, and StructureMap injects the correct stack of caches. This keeps the actual cache behaviors flexible, focused, and easy to maintain despite being fairly complex.
We've used AppFabric successfully with the seeding strategy you mention above. We actually do use both solutions:
Seed known data where possible (we have a limited set, so this is actually easy for us to figure out)
Within each cache access method, make sure to do look-aside as necessary, and populate cache on retrieval from data store.
The look-aside is necessary, as items may be evicted due to memory pressure, or simply because they were missed in the seeding operation. We have a "warming" service that pulses on an interval (an hour) and keeps the cache populated with the necessary data. We keep analysis on cache misses, and use that to tweak our warming strategy if we see frequent misses during the warming interval.
I am working on a basic Struts based application that is experience major spikes in memory. We have a monitoring tool that will notice one request per user adding 3MB to the JVM heap memory. Are there any tips to encourage earlier garbage collection, free up memory or improve performance?
The application is a basic Struts application but there are a lot of rows in the JSP report, so there may be a lot of objects created. But it isn't stuff you haven't seen before.
Perform a set of database query.
Create an serialized POJO object bean. This represents a row.
Add a row to an array list.
Set the array list to the form object when the action is invoked.
The JSP logic will iterate through the list from the ActionForm and the data is displayed to the user.
Notes:
1. The form is in session scope and possibly that array list of data (maybe this is an issue).
2. The POJO bean contains 20 or so fields, a mix of String or BigDecimal data.
The report can have 300 to 1200 or so rows. So there are at least that many objects created.
Given the information you provide, I'd estimate that you're typically loading 1 to 2 megabytes of data for a result: 750 rows * 20 fields * 100 bytes per field = 1.4 Mb. Now consider all of the temporary objects needed between the database and the final markup. 3 Mb isn't surprising.
I'd only be concerned if that memory seems to have leaked; i.e., the next garbage collection of the young generation space doesn't collect all of those objects.
List item
When desiging reports to be rendered in web application, consider the number of records fetched from database.
If the number of records is high and the overall recordset is taking lot of memory, then consider using pagination of report.
As far as possible donot invoke garbage collector explicitly. This is so because of two reasons:
Garbage collection is costly process
as it scans whole of the memory.
Most of the production servers would
be tuned at JVM level to avoid
explicit garabage collection
I believe the problem is the arraylist in the ActionForm that needs to allocate a huge chunk of memory space. I would write the query results directly to the response: read the row from the resultset, write to response, read next row, write etc. Maybe it's not MVC but it would be better for your heap :-)
ActionForms are fine for CRUD operations, but for reports ... I don't think so.
Note: if the ActionForm has scope=session the instance will be alive (along with the huge arraylist) until session expires. If scope=request the instance will be available for the GC.