What is a reasonable setting for Hibernate Search MassIndexer? - performance

In my application I use Hibernate Search to manage a Lucene index of some of my mapped model classes (10 classes, partly associated to each other -- using indexEmbedded quite some time in the index definitions). There are approx. 1,500,000 documents to index
For rebuilding the whole index, I use a mass indexer as proposed in the documentation
http://docs.jboss.org/hibernate/search/3.3/reference/en-US/html/manual-index-changes.html
fullTextSession
.createIndexer()
.batchSizeToLoadObjects(200)
.cacheMode(CacheMode.IGNORE)
.purgeAllOnStart(true)
.threadsToLoadObjects(10)
.threadsForIndexWriter(10)
.threadsForSubsequentFetching(5)
.startAndWait();
My database connection pool has a size of 50
I observe that the indexing procedure starts promising fast until it reached about 25% of all documents. After that the performance declines drastically (the next 5% take twice as long as the first 25%) and I am wondering why this happens?
Do I have a wrong ratio of object-loading threads and indexing threads?
Or is it simply due to the growing size of the index? Does this justify this decline of performance?
How to improve the performance? How to achieve a constant progress in time?
Because I make use of projections rather than letting Hibernate Search fetch search results from DB, many of my indexed fields are stored in Index (Store.YES). Does this affect the performance significantly?
-- Edit:
My Hibernate search configuration:
properties.setProperty("hibernate.search.default.directory_provider", "filesystem");
properties.setProperty("hibernate.search.default.indexBase", searchIndexPath);
properties.setProperty("hibernate.search.indexing_strategy", "manual");
properties.setProperty("hibernate.default_batch_fetch_size", "200");

Have you profiled your application. It is hard to give general recommendations in this case.
Also what configuration settings do you use? There are several properties which can influence the indexing behavior. See http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#search-batchindex-massindexer for more details. What's about memory consumption during indexing. Have you monitored this as well.
Because I make use of projections rather than letting Hibernate Search fetch search results > from DB, many of my indexed fields are stored in Index (Store.YES). Does this affect the
performance significantly?
I would expect that it mainly influences the index size not so much the indexing performance.

Related

Lucene index segment lifecycle and performance impact

For a project which maps files in a directory structure to Lucene documents (1:1), I'd like to know the impact of using multiple index segments. When a file on disk changes, the indexing process basically removes the corresponding document and adds a new one.
In the project, at the end of the indexing, the forceMerge() method of IndexWriter is used to reduce the number of segments to 1. This practice has been present in the code for a very long time, likely since early Lucene versions. As noted in the Lucene documentation, this is expensive task:
This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed).
Based on this I am considering removing this step altogether. It's just unclear what will be the performance impact.
In one answer a claim is made that multi segment performance got better over the time, however this is pretty vague statement. Is there some benchmark and/or explanatory article that would shed more light on the performance with multiple segments ? What if the segment count grows to thousands, millions ? Is this even possible ? How much will the search/indexing performance degrade ?
Also, when experimenting with disabiling the forceMerge() step, I noticed that after adding bunch of documents to the index, the next time the indexer is run, the segment count grows, however sometimes decreases after subsequent runs of the indexer (according to the segmentInfos field in the IndexReader object). Is there some automatic segment merge process ?

HibernateSearch : Reindex 50 million rows from a single table into Elastic Search

We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.
MassIndexer indexer = searchSession.massIndexer(Entity.class)
.threadsToLoadObjects(7);
indexer.start()
.thenRun(() ->
log.info("Mass Indexing Entity Complete")
)
.exceptionally(throwable -> {
log.error("Mass Indexing Entity Failed", throwable);
return null;
});
The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.
Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?
With that many entities, things are definitely going to take more than just a few minutes.
Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.
Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.
The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.
I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.
Then, I can't do much more than repeating what the reference documentation says:
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...
Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.
Or should we look at other strategies?
I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.
If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.
If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:
If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.
The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).
More information about the parameters here.
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.
To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.

What's the difference bettween Fielddata enabled vs eager global ordinals for optimising initial search query latency in Elasticsearch

I have an elasticsearch (7.10) cluster running that is primarily meant for powering search on text documents. The index that I'm working with does not need to be updated often, and there is no great necessity for speed during index time. Performance in this system is really needed for search time. The number of documents will likely always be in the range of 50-70 million and the store size is ~300GB once it's all built.
The mapping for the index and field I'm concerned with looks something like this:
"mappings": {
"properties": {
"document_text": {
"type": "text"
}
}
}
The document_text is a string of text anywhere in the region of 50-500 words. The typical queries being sent to this index are match queries chained together inside a boolean should query. Usually, the number of clauses are in the range of 5-15.
The issue I've been running into is that the initial latency for search queries to the index is very high usually in the range of 4-6s but after the first search the document is cached so the latency becomes much lower <1s. The cluster has 3 data nodes, 3 master nodes and 2 ingest/client nodes and is backed by fast SSD. I noticed that the heap on the data nodes is never really under too much pressure nor is the RAM this led me to realize that the documents weren't cached in advance the way I wanted them to be. From what I've researched I've landed on either enabling fielddata=true to get the field data object in memory at index time rather than constructing that at search time. I understand this will increase pressure on the JVM heap so I may do some frequency filtering to only place certain documents in memory. The other option I've come across is setting eager_global_ordinals=true which in some ways seems similar to enabling fielddata as it builds the mappings in-memory at index time also. I'm a bit new with ES and the terminology between the two is somewhat confusing to me. What I'd love to know is what is the difference between the two and does enable one or both of them to seem reasonable to solve the latency issues I'm having or I have completely misunderstood the docs. Thanks!
Enabling eager_global_ordinals won't do any kind of effect on your queries. Enabling global ordinals would only help for aggregations, doc values would be loaded at index refresh time instead of loading them at query time.
Enabling fielddata would also not do any real effect on your queries. Its primary purpose is sorting and aggregation, which you don't really want to do on a text field.
There's probably not much you can do with first ES queries being slower. Better focus on optimal index mappings, settings, shards, and document sizes.

Bulk insert performance in MongoDB for large collections

I'm using the BulkWriteOperation (java driver) to store data in large chunks. At first it seems to be working fine, but when the collection grows in size, the inserts can take quite a lot of time.
Currently for a collection of 20M documents, bulk insert of 1000 documents could take about 10 seconds.
Is there a way to make inserts independent of collection size?
I don't have any updates or upserts, it's always new data I'm inserting.
Judging from the log, there doesn't seem to be any issue with locks.
Each document has a time field which is indexed, but it's linearly growing so I don't see any need for mongo to take the time to reorganize the indexes.
I'd love to hear some ideas for improving the performance
Thanks
You believe that the indexing does not require any document reorganisation and the way you described the index suggests that a right handed index is ok. So, indexing seems to be ruled out as an issue. You could of course - as suggested above - definitively rule this out by dropping the index and re running your bulk writes.
Aside from indexing, I would …
Consider whether your disk can keep up with the volume of data you are persisting. More details on this in the Mongo docs
Use profiling to understand what’s happening with your writes
Do have any index in your collection?
If yes, it has to take time to build index tree.
is data time-series?
if yes, use updates more than inserts. Please read this blog. The blog suggests in-place updates more efficient than inserts (https://www.mongodb.com/blog/post/schema-design-for-time-series-data-in-mongodb)
do you have a capability to setup sharded collections?
if yes, it would reduce time (tested it in 3 sharded servers with 15million ip geo entry records)
Disk utilization & CPU: Check the disk utilization and CPU and see if any of these are maxing out.
Apparently, it should be the disk which is causing this issue for you.
Mongo log:
Also, if a 1000 bulk query is taking 10sec, then check for mongo log if there are any few inserts in the 1000 bulk that are taking time. If there are any such queries, then you can narrow down your analysis
Another thing that's not clear is the order of queries that happen on your Mongo instance. Is inserts the only operation that happens or there are other find queries that run too? If yes, then you should look at scaling up whatever resource is maxing out.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources