I use ElasticSearch 6.2.4 and currently using Scroll to deal with queries that returns more than 10000 documents, since most of the queries return less (90%) and involve real time usage building scroll is inefficient so I consider to start use the Search After feature.
I noticed that in the example in Search After Doc that a sort is used and if I understand correctly it will be used in every query , will elastic order the results every time again and again? this can have a huge implication on performance.
I read about the Sorted Index, can this solve my problem since the shard will always be sorted?
Is there any reason to use search after without Sort Index?
Related
I am trying to optimize searching speed on my index containing text data. My query is very complex, there are lots of bool queries at many levels. Each query has also a filter query, which narrows the search area. My idea was to create filtered alias (with filter query) on my index and search through my query (without the part with filter query), but id didn't improve searching speed.
My second idea is to make reindex on my filtered index (every day for example) and then search on new index (that decreased speed a lot), but the data comes to index simultanously in real time, so this solution is not appropriate for the production environment.
Could you advice any solution how to deal with that?
I am using Elasticsearch 6.2, and I have some queries that analyze a massive amount of documents. I am sorting to one field inside the index. Elasticsearch examines 10.000 documents (default configuration value) and then returns them paginated.
I tried to read the documentation, but I cannot find any information if the database applies the sorting before or after the analysis process of the documents from the index.
In other words, the sort is applied directly during the index analysis or the documents are sorted once analyzed? If the last option is correct, which kind of sort applies Elasticsearch during the scan?
Thanks a lot.
Sorting, aggregations, and access to field values in scripts requires
a different data access pattern. Instead of looking up the term and
finding documents, we need to be able to look up the document and find
the terms that it has in a field.
This quote from the Elasticsearch reference documentation implies to me, that sorting is happening on the non-analyzed level, but I've also decided to double check and do some tests on it.
In the Elasticsearch we have capabilities to do sorting on non-analyzed fields - e.g. keyword. Those fields are using doc-values to do sorting and after the test I could say that it's using pre-analyzed values to do sorting according to the codes representing characters (numbers, uppercase letters, lowercase letters)
It's also possible to do a sorting on text fields with some caveat and tuning (e.g. need to enable fielddata, since text fields do not support doc_values)
In this case the documents are sorted according to analyzed values. Of course a lot depends on analyzing pipeline, since it could do various stuff to the text. Also, just as a reminder:
Fielddata can consume a lot of heap space, especially when loading
high cardinality text fields. Once fielddata has been loaded into the
heap, it remains there for the lifetime of the segment. Also, loading
fielddata is an expensive process which can cause users to experience
latency hits. This is why fielddata is disabled by default.
I need to use machine learning algorithms in order to sort / rank query results.
Our queries are running on elasticsearch.
For that, we need to combine data from the document itself, from the explanation section (although the explanation should not be returned) and from external sources.
This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
Creating a scoring plugin will run on all documents; I didn't see any option to create plugin for the rescoring phase.
So, it seems like I must create a sorting plugin.
My question is - how many documents are running through the sorting phase? Is there any way to control it (like window_size in rescore)? What happens if I have pagination - does my sorting runs again?
Is it possible to get 1000 docs with the explanation section into the sorting phase and return only 100 without the explanation?
Thanks!
-This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
use rescoring combined with your scoring plugin, rescoring algo runs only on top N results
-how many documents are running through the sorting phase?
all which match your query, if you have asked for N docs , each shard sends top N and then they are merged together
-What happens if I have pagination - does my sorting runs again?
yes , sorting runs again and worse if you asked for documents fro 100000 to 100010 , sorting happens for 100010 docs per shard
I'm working on a simple index containing one million docs with 30 fields each.
a q=: with a very low start value (0 for instance) takes only a few milliseconds (~1 actually)
the higher the start value is, the slowest SolR gets...
start=100000 => 171 ms
start=500000 => 844 ms
start=1000000 => 1274 ms
I'm a bit surprised by this performance degradation, and I'm afraid since the index is supposed to grow over hundred million documents within a few month.
Maybe did I something wrong in the schema? Or is it aenter code here normal behavior, given slicing docs behind the few first hundreds should usually not happen :)
EDIT
Thanks guys for those explanations - I was guessing something like that, however I do prefer be sure that this was not related to the way the schema has been described. So the question solved for me.
Every time you make search query to solr, it collect all the matching documents to the query. Then it skip the document until the start value is reached and then return the results.
Other point to note is that, every time you make the same search query but with higher start value, these documents are also not present in cache, so it might refresh cache as well. (depending on the size and type of cache you have configured)
Pagination naively works by retrieving all the documents up until the cut off point, throwing them away, then fetching enough documents to satisfy the number of documents requested and then returning.
If you're going to deep paging (going far into the dataset) this becomes expensive, and the CursorMark support was implemented (see "Fetching A Large Number of Sorted Results: Cursors") to support near-instant pagination into a large set of documents.
Yonik also has a good blog post about deep pagination in Solr.
I have a number of documents stored in Solr cluster, and want to get large amount of them (about 20 millions) by particular query. I use standard approach reading batches of rows (say, 10000) and moving to next batch with start parameter. However after about 1 400 000 docs I'm starting to get OutOfMemoryError. I believe this is because of the way Solr sorts docs before sending them to the client. As far as I know, it uses priority queue to get only top N results, and thus need not to load headers of all documents into memory. However, when I ask it to return results, say, from 1,000,000 to 1,010,000, it has to load headers for all previous 1,000,000 docs too.
I'm looking for a way to avoid this and just get all results satisfying query without sorting. Is there a way to do it? If not, what is appropriate way to get large number of results from Solr?
Your assumptions are correct. When you search for results from 1,000,000 to 1,010,000, Solr instantiates a priority queue of size 1,010,000.
This is really not a natural use-case for Solr which has been designed to return the top-k list of results, rather than an exhaustive list of results.
You could work around this by filtering by ranges of your primary key (q=yourquery&fq=ID:[1 TO 1000]&rows=1000, q=yourquery&fq=ID:[1001 TO 2000]&rows=1000, ...) but this is a ugly hack. :-)
Why do you need to get all results? For example, if you need to compute facets or statistics, Solr has two components that can do that efficiently.