Is there a way to improve memory performance when using an elasticsearch percolator index?
I have created a separate index for my percolator. I have roughly 1 000 000 user created saved searches (for email alerts). After creating this percolator index, my heap usage spikes to 100% and the server became unresponsive for any queries. I have somewhat limited resources and am not able to simply throw more RAM at the problem. The only solution was to delete the index containing my saved searches.
From what I have read the percolator index resides in-memory permanently. Is this entirely necessary? Is there a way to throttle this behaviour but still preserve the functionality? Is there a way to optimize my data/queries/index structure to circumvent this behaviour while still achieving the desired result?
There is no resolution to this issue from an ElasticSearch point of view nor is one likely. I have chatted to the ElasticSearch guys directly and their answer is: "throw more hardware at it".
I have however found a way to solve this problem in terms of mitigating my usage of this feature. When I analyzed my saved search data I discovered that my searches consisted of around 100 000 unique keyword searches along with various filter permutations creating over 1 000 000 saved searches.
If I look at the filters they are things like:
Location - 300+
Industry - 50+
etc...
Giving a solution space of:
100 000 * >300 * >50 * ... ~= > 1 500 000 000
However if I were to decompose the searches and index the keyword searches and filters separately in the percolator index,
I end up with far less searches:
100 000 + >300 + >50 + ... ~= > 100 350
And those searches themselves are smaller and less complicated than the original searches.
Now I create a second (non-percolator) index listing all 1 000 000 saved searches and including the ids of the search components from the
percolator index.
Then I percolate a document and then do a second query filtering the searches against the keyword and filter percolator results.
I'm even able to preserve the relevance score as this is returned purely from the keyword searches.
This approach will significantly reduce my percolator index memory footprint while serving the same purpose.
I would like to invite feedback on this approach (I haven't tried it yet but I will keep you posted).
Likewise if my approach is successful do you think it is worth a feature request?
Related
I am in a real pain with Elastic App Search, the indexing limit for bulk index is 100 according to the docs:
https://www.elastic.co/guide/en/app-search/current/limits.html
I was trying to create all the promises and then do promise.all(allPromises), but it's failing to index everything, and the response of when this fails still return 200, and you have to loop over:
res.data (all the 100 documents array), and look if they have error field.
Is there any solution to index lot of document fast? Because indexing 1 million with loop to await between every 100 batch size query is extremely slow.
Unfortunately the limit of 100 makes it a slow operation. We are indexing 1.1 million documents and our solution was to slice that our into ten segments and run ten processes in parallel to decrease the time that it takes.
Since reading from the index is very quick we have a separate job that validates the information in App Search matches our source data and flags anything out of order. So I don't check for errors on a full import or update so that part goes as fast as possible. We only see around a 1% failure rate on bulk imports.
I should note that part of the reason for the second piece is we have on occasion found the search index can get out of alignment with the documents and fields we have, so validation seemed like a good idea.
Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D
The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.
I am deploying an approach which uses from parameter a lot of times. I wish to understand how 'skip' works in elasticsearch or other such systems in general to judge what performance lost does it incur.
It depends on search type. If you use the default, i.e. query then fetch, then to fetch page 20 with size 10 (from: 190, size: 10), elasticsearch will:
ask each primary shard for ids and relevance scores of top 200 documents (which are selected from all docs matching the query, so this means searching the whole index, but this is the same as with fetching only the first page)
merge the results, sorting by relevance, and skip 190 top hits of such merged list, taking those 10 that follow
fetch actual docs (i.e. 10 of them) from relevant shards
It means that if you have e.g. 3 primary replicas, then elasticsearch nodes need to exchange information about 3 * 200 = 600 docs. There are some optimizations to make obtaining particularly 'distant' pages more efficient, but in a nutshell, you need to process more and more documents each time you fetch next page.
If your use case involves going through a result set sequentially, consider scrolling.
If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.
two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.
This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.
100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.
An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.