Using ehcache as an inmemory data store - High memory Usage - ehcache

We are trying to solve a problem with a legacy UI and a legacy schema where a complex query with multiple joins gets called at an interval of 5 mins in a multi user environment for a million records the performance goes down drastically. To address this we tried keeping the whole resultset from the query in memory using ehcache on a Remote server (a VM with 8GB RAM and Weblogic Application) using RMI and ehcache’s inbuilt querying facilities we are able to achieve a good performance fetching the records that are required. We have built an API that helps monitor and update each record when there is a change or an insert. The problem we are facing is that caching the whole resultset (1 million records) in memory of the ehcache and querying it frequently leads to almost a 6GB+ usage the query performance degrades gradually until it seems that a GC is done and then again the performance comes back to normal (after observing the usage on Visualvm). Each record from the resultset is stored as an object that with its primary key value as its key.
What would be a good way to work around this problem? Considering the objects are set to never expire and all unwanted records are removed manually using code whenever they are updated
Is there a better way to look at this problem as a whole?

Quick list of options:
Split your cache accross different VMs if you can easily shard your data. By having multiple caches in different JVMs you would automatically get lower memory footprint for each, although the overall memory usage will grow due to the multiple JVMs.
Consider upgrading Ehcache to have BigMemory Go support, which will allow you to use off-heap memory and thus reduce GC pressure. This is a commercial product requiring a license.

Related

HibernateSearch : Reindex 50 million rows from a single table into Elastic Search

We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.
MassIndexer indexer = searchSession.massIndexer(Entity.class)
.threadsToLoadObjects(7);
indexer.start()
.thenRun(() ->
log.info("Mass Indexing Entity Complete")
)
.exceptionally(throwable -> {
log.error("Mass Indexing Entity Failed", throwable);
return null;
});
The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.
Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?
With that many entities, things are definitely going to take more than just a few minutes.
Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.
Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.
The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.
I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.
Then, I can't do much more than repeating what the reference documentation says:
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...
Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.
Or should we look at other strategies?
I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.
If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.
If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:
If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.
The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).
More information about the parameters here.
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.
To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.

Azure Redis cache latency

I am working on an application having web job and azure function app. Web job generates the redis cache for function app to consume. Cache size is around 10 Mega Bytes. I am using lazy loading and all as per the recommendation. I still find that the overall cache operation is slow. Depending upon the size of the file i am processing, i may end up calling Redis cache upto 100,000 times . Wondering if I need to hold the cache data in a local variabke instead of reading it every time from redis. Has anyone experienced any latency in accessing Redis? Does it makes sense to create a singletone object in c# function app and refresh it based on some timer or other logic?
could you consider this points in your usage this is some good practices of azure redis cashe
Redis works best with smaller values, so consider chopping up bigger data into multiple keys. In this Redis discussion, 100kb is considered "large". Read this article for an example problem that can be caused by large values.
Use Standard or Premium Tier for Production systems. The Basic Tier is a single node system with no data replication and no SLA. Also, use at least a C1 cache. C0 caches are really meant for simple dev/test scenarios since they have a shared CPU core, very little memory, are prone to "noisy neighbor", etc.
Remember that Redis is an In-Memory data store. so that you are aware of scenarios where data loss can occur.
Reuse connections - Creating new connections is expensive and increases latency, so reuse connections as much as possible. If you choose to create new connections, make sure to close the old connections before you release them (even in managed memory languages like .NET or Java).
Locate your cache instance and your application in the same region. Connecting to a cache in a different region can significantly increase latency and reduce reliability. Connecting from outside of Azure is supported, but not recommended especially when using Redis as a cache (as opposed to a key/value store where latency may not be the primary concern).
Redis works best with smaller values, so consider chopping up bigger data into multiple keys.
Configure your maxmemory-reserved setting to improve system responsiveness under memory pressure conditions, especially for write-heavy workloads or if you are storing larger values (100KB or more) in Redis. I would recommend starting with 10% of the size of your cache, then increase if you have write-heavy loads. See some considerations when selecting a value.
Avoid Expensive Commands - Some redis operations, like the "KEYS" command, are VERY expensive and should be avoided.
Configure your client library to use a "connect timeout" of at least 10 to 15 seconds, giving the system time to connect even under higher CPU conditions. If your client or server tend to be under high load, use an even larger value. If you use a large number of connections in a single application, consider adding some type of staggered reconnect logic to prevent a flood of connections hitting the server at the same time.

ServiceStack Redis - caching expensive queries

We have a number of really expensive queries, which involve multiple joins, which I would like to cache using Redis (using the ultimate ServiceStack.Redis framework).
How many rows/items should I be storing in Redis before memory becomes an issue?
e..g can I store 10 000+ rows into Redis without worrying about memory issues (our server, which also hosts our web app has 8Gb Ram).
Secondly, what is the best way of storing them (as List or Hash?).
For the number of rows it depends on the row size. The best approach would be to start saving and see the memory usage on the Redis server. 10k doesn't sound like too much data.
On how to store them I would use a Hash only if I need to retrieve specific rows, for example if I would do the filtering and sorting in Redis, which theoretically is possible. But most likely filtering and sorting of the results is done in your app so you can keep all that data in one key only. What we did in our app is serialized all the results in json, archived them in code and then saved to a simple key Redis and this gave the smallest memory consumption.

Oracle Coherence Caching and application server's CPU usage

We started implementing Coherence in our application to improve performance and reduce load on the DB server and reduce web service calls.
We usually experience high CPU usage ( weblogic App server's JVM) during high load, DB servers are usually not an issue.
Other than response time improvement, How would oracle Coherence improve application server's CPU and Heap usage during high load.
1) reduce XML processing as we will start retrieving Java Objects from the cache that are ready to be used rather than having to unmarshall the XMLs.
2) reduce ORM mapping overhead as we won't be mapping table rows to objects for cached data....
3) What else?
Thanks a lot
Disclaimer - I work for Oracle on Coherence.
Assuming that the CPU load is going to XML marshaling, you should see reduced CPU consumption by placing the resulting objects in a cache. You'll still pay CPU for serialization, but object serialization takes much less CPU - and if you use POF for serialization you'll see even better performance.
If there is any affinity to where the objects are being used, you can take advantage of near caching to avoid going to the network to retrieve cached objects. This will only help if you do more reads than writes.
With Coherence you don't need to throw away your ORM - you can write a CacheStore (or use the JPA CacheStore we ship with OOTB) to transparently read from the database upon cache misses and update the database when the cache is updated. This works best if you're retrieving your ORM objects via primary key.
Without more details on what exactly is taking up CPU (thread dumps are a good low tech way to diagnose this) it's hard to say how much caching will help.

How to index records in Solr faster (and not impact ColdFusion web server)? Two JVM?

I have a 64 bit server, 8 GB RAM, dual quad CPU. No resources are ever hitting 100% (except, I guess, the JVM -- right?).
I need to index several million records for Solr, but the machine is in production. I recognize having a second machine for indexing would be helpful.
Should I dedicate a second instance of the JVM, dedicated to Solr?
Right now, when I run an index, pages which are normally served in 200 milliseconds will serve up in about 1.5 seconds, sometimes more... hitting, even, the dreaded "Service is Unavailable" error.
I adjusted my JVM Heap as follows:
-Xmx1024m
-XX:MaxPermSize256m
In case I'm chasing the wrong solution, allow me to broaden the landscape a bit. It seems that I can't affect the indexing speed of Solr. I had previously been indexing about 150,000 records per hour on a dev server virtualized on a workstation. In a production environment with much more hardware available, I'm indexing at the exact same speed.
Without data to prove it, I think that my JVM adjustments did not speed up the indexing, although it may have allowed the CF server to continue serving pages. I must say, the indexing speed is terribly slow, but I do know that it's not a function of the data access layer. I rewrote it from pure ORM to objects backed by SQL Stored Procedures thinking that was the slowdown (no effect).
use a separate instance for indexing the index, the only trick is getting the running searching instance to re-read the updated index, in which case, you set up a master (the indexer) and slave(the searcher) and do replication. this will both make the searcher not get interrupted, and the indexer will utilize its own JVM including its own share of the resources.
Have you tried these optimization tips?
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-1.html
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-2.html
http://bytestopshere.com/post.cfm/lessons-learned-moving-from-verity-to-solr-part-1

Resources