Solr indexing of a large data set

Solr indexing of a large data set - performance

I have content that is about 50 TB large. The number of documents in this set is about 250 million. The daily increment to this is not very large nay my be about 10000 documents of varying sizes totaling under 50 MB.
The current indexing effort is taking way too long and is guesstimated to complete in 100+ days!!!
So ... is this really that large of a data set? To me, 50 TB of content (in this day and age) is not very large. Do you have content of this size? If you do, how did you improve time taken for one-time indexing? Also, how did you improve time taken by real-time indexing?
If you can answer .. great. If you can point me in the right direct direction ... appreciate that as well.
Thanks in advance.
rd

There are number of factors to consider.
You can start with Client to index. Which client are you using. Is it Solrj, or any framework which listens to databases(like oracle or Hbase) or rest API.
This can make a difference, given that Solr is good at handling them, however the client framework and data preparation at client, also needs to be optimized. For example, if you use Hbase Indexer(which reads from Hbase tables and writes to Solr), you can expect few millions to be indexed in hour or so. Then, this should not take much time to complete 250 million.
After the client, you enter into Solr environment. How many fields are you indexing in you document. Also do you have stored fields or any other overheads for field types.
Config parameters like autoCommit based on number of records or RAm size, softCommit as mentioned in the comment above, Parallel Threads to index data, Hardware are some of the points to cosider.
You can find comprehensive check list here and can verify each. Happy Designing

Related

HibernateSearch : Reindex 50 million rows from a single table into Elastic Search

We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.
MassIndexer indexer = searchSession.massIndexer(Entity.class)
.threadsToLoadObjects(7);
indexer.start()
.thenRun(() ->
log.info("Mass Indexing Entity Complete")
)
.exceptionally(throwable -> {
log.error("Mass Indexing Entity Failed", throwable);
return null;
});
The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.
Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.
What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.

What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?
With that many entities, things are definitely going to take more than just a few minutes.
Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.
Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.
The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.
I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.
Then, I can't do much more than repeating what the reference documentation says:
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...
Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.
Or should we look at other strategies?
I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.
If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.
If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:
If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)
UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?
No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.
The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).
More information about the parameters here.
And, what is the use of idFetchSize? Looks like it is not used in the indexing process.
This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.
To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.

Duplicate Key Filtering

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a rolling 10 days’ worth of keys, at approximately 100 bytes per key. I was wondering how this type of large scale problem has been solved before using Hadoop. Would HBase be the correct solution to use? Has anyone ever tried a partially in-memory solution like Zookeeper?

I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?
Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.
If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.
If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java
ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.
So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.

I am afraid that it is impossible with traditional systems :|
Here is what U have mentioned:
100 billion per days means approximation 1 million per second!!!!
size of the key is 100 bytes.
U want to check for duplicates in a 10 day working set means 1 trillion items.
These assumptions results in look up in a set of 1 trillion objects that totally size in 90 TERABYTES!!!!!
Any solution to this real-time problem shall provide a system that can look up 1 million items per second in this volume of data.
I have some experience with HBase, Cassandra, Redis, and Memcached. I am sure that U cannot achieve this performance on any disk-based storage like HBase, Cassandra, or HyperTable (and add any RDBMSs like MySQL, PostgreSQl, and... to these). The best performance of redis and memcached that I have heard practically is around 100k operations per second on a single machine. This means that U must have 90 machines each having 1 TERABYTES of RAM!!!!!!!!
Even a batch processing system like Hadoop cannot do this job in less than an hour and I guess it will take hours and days on even a big cluster of 100 machines.
U R talking about very very very big numbers (90 TB, 1M per second). R U sure about this???

What is the ideal bulk size formula in ElasticSearch?

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes

There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.

I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.

Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
      The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
      Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
      It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.

What are the performance considerations when adding a large number of documents to a large Solr core?

If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.

two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.

This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.

100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.

An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.

You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.

It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio