HibernateSearch : Reindex 50 million rows from a single table into Elastic Search - elasticsearch

We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.
MassIndexer indexer = searchSession.massIndexer(Entity.class)
.threadsToLoadObjects(7);
indexer.start()
.thenRun(() ->
log.info("Mass Indexing Entity Complete")
)
.exceptionally(throwable -> {
log.error("Mass Indexing Entity Failed", throwable);
return null;
});
The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.
Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.

What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?
With that many entities, things are definitely going to take more than just a few minutes.
Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.
Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.
The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.
I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.
Then, I can't do much more than repeating what the reference documentation says:
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...
Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.
Or should we look at other strategies?
I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.
If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.
If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:
If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.
The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).
More information about the parameters here.
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.
To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.

Related

Bulk insert performance in MongoDB for large collections

I'm using the BulkWriteOperation (java driver) to store data in large chunks. At first it seems to be working fine, but when the collection grows in size, the inserts can take quite a lot of time.
Currently for a collection of 20M documents, bulk insert of 1000 documents could take about 10 seconds.
Is there a way to make inserts independent of collection size?
I don't have any updates or upserts, it's always new data I'm inserting.
Judging from the log, there doesn't seem to be any issue with locks.
Each document has a time field which is indexed, but it's linearly growing so I don't see any need for mongo to take the time to reorganize the indexes.
I'd love to hear some ideas for improving the performance
Thanks
You believe that the indexing does not require any document reorganisation and the way you described the index suggests that a right handed index is ok. So, indexing seems to be ruled out as an issue. You could of course - as suggested above - definitively rule this out by dropping the index and re running your bulk writes.
Aside from indexing, I would …
Consider whether your disk can keep up with the volume of data you are persisting. More details on this in the Mongo docs
Use profiling to understand what’s happening with your writes
Do have any index in your collection?
If yes, it has to take time to build index tree.
is data time-series?
if yes, use updates more than inserts. Please read this blog. The blog suggests in-place updates more efficient than inserts (https://www.mongodb.com/blog/post/schema-design-for-time-series-data-in-mongodb)
do you have a capability to setup sharded collections?
if yes, it would reduce time (tested it in 3 sharded servers with 15million ip geo entry records)
Disk utilization & CPU: Check the disk utilization and CPU and see if any of these are maxing out.
Apparently, it should be the disk which is causing this issue for you.
Mongo log:
Also, if a 1000 bulk query is taking 10sec, then check for mongo log if there are any few inserts in the 1000 bulk that are taking time. If there are any such queries, then you can narrow down your analysis
Another thing that's not clear is the order of queries that happen on your Mongo instance. Is inserts the only operation that happens or there are other find queries that run too? If yes, then you should look at scaling up whatever resource is maxing out.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

Strategy for "user data" in couchbase

I know that a big part of the performance from Couchbase comes from serving in-memory documents and for many of my data types that seems like an entirely reasonable aspiration but considering how user-data scales and is used I'm wondering if it's reasonable to plan for only a small percentage of the user documents to be in memory all of the time. I'm thinking maybe only 10-15% at any given time. Is this a reasonable assumption considering:
At any given time period there will be a only a fractional number of users will be using the system.
In this case, users only access there own data (or predominantly so)
Recently entered data is exponentially more likely to be viewed than historical user documents
UPDATE:
Some additional context:
Let's assume there's a user base of a 1 million customers, that 20% rarely if ever access the site, 40% access it once a week, and 40% access it every day.
At any given moment, only 5-10% of the user population would be logged in
When a user logs in they are like to re-query for certain documents in a single session (although the client does do some object caching to minimise this)
For any user, the most recent records are very active, the very old records very inactive
In summary, I would say of a majority of user-triggered transactional documents are queried quite infrequently but there are a core set -- records produced in the last 24-48 hours and relevant to the currently "logged in" group -- that would have significant benefits to being in-memory.
Two sub-questions are:
Is there a way to indicate a timestamp on a per-document basis to indicate it's need to be kept in memory?
How does couchbase overcome the growing list of document id's in-memory. It is my understanding that all ID's must always be in memory? isn't this too memory intensive for some apps?
First,one of the major benefits to CB is the fact that it is spread across multiple nodes. This also means your queries are spread across multiple nodes and you have a performance gain as a result (I know several other similar nosql spread across nodes - so maybe not relevant for your comparison?).
Next, I believe this question is a little bit too broad as I believe the answer will really depend on your usage. Does a given user only query his data one time, at random? If so, then according to you there will only be an in-memory benefit 10-15% of the time. If instead, once a user is on the site, they might query their data multiple times, there is a definite performance benefit.
Regardless, Couchbase has pretty fast disk-access performance, particularly on SSDs, so it probably doesn't make much difference either way, but again without specifics there is no way to be sure. If it's a relatively small document size, and if it involves a user waiting for one of them to load, then the user certainly will not notice a difference whether the document is loaded from RAM or disk.
Here is an interesting article on benchmarks for CB against similar nosql platforms.
Edit:
After reading your additional context, I think your scenario lines up pretty much exactly how Couchbase was designed to operate. From an eviction standpoint, CB keeps the newest and most-frequently accessed items in RAM. As RAM fills up with new and/or old items, oldest and least-frequently accessed are "evicted" to disk. This link from the Couchbase Manual explains more about how this works.
I think you are on the right track with Couchbase - in any regard, it's flexibility with scaling will easily allow you to tune the database to your application. I really don't think you can go wrong here.
Regarding your two questions:
Not in Couchbase 2.2
You should use relatively small document IDs. While it is true they are stored in RAM, if your document ids are small, your deployment is not "right-sized" if you are using a significant percentage of the available cluster RAM to store keys. This link talks about keys and gives details relevant to key size (e.g. 250-byte limit on size, metadata, etc.).
Basically what you are making a decision point on is sizing the Couchbase cluster for bucket RAM, and allowing a reduced residency ratio (% of document values in RAM), and using Cache Misses to pull from disk.
However, there are caveats in this scenario as well. You will basically also have relatively constant "cache eviction" where "not recently used" values are being removed from RAM cache as you pull cache missed documents from disk into RAM. This is because you will always be floating at the high water mark for the Bucket RAM quota. If you also simultaneously have a high write velocity (new/updated data) they will also need to be persisted. These two processes can compete for Disk I/O if the write velocity exceeds your capacity to evict/retrieve, and your SDK client will receive a Temporary OOM error if you actually cannot evict fast enough to open up RAM for new writes. As you scale horizontally, this becomes less likely as you have more Disk I/O capacity spread across more machines all simultaneously doing this process.
If when you say "queried" you mean querying indexes (i.e. Views), this is a separate data structure on disk that you would be querying and of course getting results back is not subject to eviction/NRU, but if you follow the View Query with a multi-get the above still applies. (Don't emit entire documents into your Index!)

What are the performance considerations when adding a large number of documents to a large Solr core?

If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.
two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.
This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.
100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.
An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources