We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.
MassIndexer indexer = searchSession.massIndexer(Entity.class)
.threadsToLoadObjects(7);
indexer.start()
.thenRun(() ->
log.info("Mass Indexing Entity Complete")
)
.exceptionally(throwable -> {
log.error("Mass Indexing Entity Failed", throwable);
return null;
});
The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.
Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?
With that many entities, things are definitely going to take more than just a few minutes.
Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.
Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.
The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.
I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.
Then, I can't do much more than repeating what the reference documentation says:
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...
Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.
Or should we look at other strategies?
I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.
If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.
If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:
If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.
The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).
More information about the parameters here.
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.
To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.
I'm performing some select queries with SQLite. The columns are already indexed.
Queries are in the format:
SELECT stuff
FROM table
WHERE haystack LIKE "%needle%"
I'm also loading the db into memory by running RESTORE FROM my_db.db and this seems to be working (used memory goes up)
The problem is - queries like the above still take a long time (close to 100ms), and I need to run a lot of them (thousands at a time), which means the results take up to several minutes to arrive.
However while the queries are running, I see a very low CPU usage - around 25-35% on average on all cores.
I think SQLite might be throttling its own performance somehow and not using all of the CPU.
Is there a way to get it to not throttle, and to utilize the full system resources so queries are faster?
Thanks.
I want to disable SPARQL query caching of Fuseki server. Can I disable it? And how to do ? I'm considering the following ways:
Using command line argument - It looks unprepared
Using settings file (*.ttl) - I couldn't find notation to disable caching
Edit server code - Basically I won't do it :(
Please tell how can I disable caching.
What caching are you talking about?
As discussed in JENA-388 the current default behaviour is actually to add headers that disable caching so there is not any HTTP level caching.
If you are using the TDB backend then there are caches used to improve query performance and those are not configurable AFAIK. Also even if you could do it turning them off would likely drastically worsen performance so would not be a good idea.
Edit
The --mem option uses a pure in-memory dataset so there is no caching. Be aware that this will actually be much slower than using TDB as you scale up your data and is only faster at small dataset sizes.
If you are looking to benchmark then there are much better ways to eliminate the effect of caches than turning them off since disabling caches (even when you can) won't give you realistic performance numbers. There are several real world ways to eliminate cache effects:
Run warmups - either some fixed number or until you see the system reach a steady state.
Eliminate outliers in your statistics, discard the best and worst N results and compute your statistics over the remainder
Use query parameterisation, use a query template and substitute different constants into it each time thus ensuring you aren't issuing an identical query each time. Query plan caching may still come into effect but as Jena doesn't do this anyway it won't matter for your tests.
You may want to take a look at my 2012 SemTech talk Practical SPARQL Benchmarking and the associated SPARQL Query Benchmarker tool. We've been working on a heavily revised version of the tool lately which has a lot of new features such as support for query parameterisation.
I ran a search and the first time and it took 3-4 seconds.
I ran the same search second time and it took less than 100 ms (as expected as it used the cache)
Then I cleared the cache by calling "http://host:port/index/_cache/clear"
Next I ran the same search and was expecting it to take 3-4 seconds but it took less than 100 ms
So the clearing of the cache didn't work?
What exactly got cleared by that url?
How I do make ES do the raw search (i.e. no caching) every time?
I am doing as a part of some load testing.
Clearing the cache will empty:
Field data (used by facets, sorting, geo, etc)
Filter cache
parent/child cache
Bloom filters for posting lists
The effect you are seeing is probably due to the OS file system cache. Elasticsearch and Lucene leverage the OS file system cache heavily due to the immutable nature of lucene segments. This means that small indices tend to be cached entirely in memory by your OS and become diskless.
As an aside, it doesn't really make sense to benchmark Elasticsearch in a "cacheless" state. It is designed and built to operate in a cached environment - much of the performance that Elasticsearch is known for is due to it's excellent use of caching.
To be completely accurate, your benchmark should really be looking at a system that has fully warmed the JVM (to properly size the new-eden space, optimize JIT output, etc) and using real, production-like data to simulate "real world" cache filling and eviction on both the ES and OS levels.
Synthetic tests such as "no-cache environment" make little sense.
I don't know if this is what you're experiencing, but the cache isn't cleared immediately when you call clear cache. It is scheduled to be deleted in the next 60 seconds.
source: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-clearcache.html
I have a couch db application and for most of the views I notice that the time taken by the server to return a response varies from 10ms to 100ms. I do not have any concurrent write operations on the server and there are at the most 10 concurrent read requests.
How should I diagnose the problem ? Where you I look ?
I am running it on a rackspace cloud machine with 1GB RAM.
From the Couchdb Guide:
If you read carefully over the last few paragraphs, one part stands out: “When you query your view, CouchDB takes the source code and runs it for you on every document in the database.” If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document.
Most likely you are seeing the views be regenerated and recached.