Is is possible to get elasticsearch to terminate its search early and just return the first N matches it finds?
I have a large data set and have noticed that when I issue a query that hits all the records, it takes much longer to return the top 10 results than if the query hits only a small number of results. I don't really need the full result count, and I don't care whether the 10 results returned are the "best" matches.
In addition to setting the size as in Richa's answer, you might also want to check the two following request parameters, namely:
timeout: allows you to specify a maximum execution time (in milliseconds). ES will respond as soon as that timeout is reached and return the results it got so far.
terminate_after: the maximum number of docs to get in each shard
You can use size
GET /index/type/_search?size=5
Related
Environment
.Net 5
Elasticsearch.Net.Aws 7.1.0
Problem
Even with pagination, Elasticsearch's query API does not support more than 10_000 records by default. I.e. if the sum of from and size > 10_000 the API throws an error.
Potential solutions
Increase size
I can increase the index's max_result_window as described here. However I am expecting a large dataset in production - probably less than 10_000_000 records at one time, but for obvious reasons I don't believe that simply increasing the window size is a good idea. My use-case does not require over-the-top performance, but it has to be reasonable for both the end-user and the AWS bill.
What do you think? What leeway do I have regarding to max_result_window setting?
Track total hits
I've read about track_total_hits parameter - It only returns the correct amount of total hits on each request, but still does not allow records after the 10_000th to be fetched
Scroll API
I've read about the Scroll-API - it's being deprecated currently, so I'd like to avoid it.
Search after
I've read about the search_after parameter - the concept is to define a consistent sort criteria and call exact query for each page, the only difference being is the value of search_after, which for every subsequent search should be the sort value returned of the last hit in the previous search.
As far as I can tell this is the recommended solution, but while it may work for large page sizes, I'm having difficulty understanding how it solves the basic paging case:
Lets say we have 20_000 records total, page size is 10, hense 2_000 pages. How can I return the last page, containing records 19_990-20_000? Unless I misunderstand, search_after does not help, because I've skipped pages and I don't have the sort value of record number 19_989.
Further more, per the docs:
If provided, the from argument must be 0 (default) or -1
This means that I cannot use a combination of both:
Perform one search with "from": "990"
Use the last record's sort value to perform a second search, again using a "from": "990"
Return the results of the second search.
Beyond that I cannot figure out another way to use it. Could you tell me where I'm getting it wrong?
I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.
Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.
But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?
Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."
Hopefully someone can explain how things work under the hood and which important point I am missing.
There are at least two different contexts in which not all documents need to be sorted:
A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.
B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.
If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.
If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.
Relevant links:
https://github.com/elastic/elasticsearch/pull/24864
https://github.com/elastic/elasticsearch/issues/33028
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
I have the intention to use the Terminate After feature of elasticsearch in order to reduce the result set.
The question is, the documents retrieved when using Terminate After, are ranked among the complete set of documents, or just among the reduced returned set?
Terminate after limits the number of search hits per shard so any document that may have had a hit later could also have had a higher ranking(higher score) than highest ranked document returned since the score used for ranking is independent of the other hits.
So yes the document will be ranked depending upon only the result set returned, but this would not affect how the actual score was calculated which takes into account all the documents.
Wanting a reduced result set and wanting it to be ranked depending on all the hits that may have occurred is a contradiction in itself.
Terminate after is generally used for filter type queries where the score of all returned docs is the same so that ranking doesn't matter.
For match type queries ES uses pagination so it's already quite efficient and you don't really need to restrict the document set anyways.
From the documentation here: https://msdn.microsoft.com/en-us/library/dn760793.aspx
It says:
totalEstimatedMatches:
The estimated number of news articles that are relevant to the query. Use this number along with the count and offset query parameters to page the results.
However, there are some serious issues.
1.The returned number of results is ALWAYS less than the requested number in the "count" variable. For example, setting a count=100 results in only 75 results.
2.What's more, even skipping the difference and sending another query to the API with an offset (in this example, offset=100), the API returns a new totalEstimatedMatches!! (first query was 70k results, second time was 138)
What is going on here? How do we fully get the totalEstimatedMatches returned from the first query? Or is that a bogus inflated number?
We did some investigation on this issue. Basically, search engine index does not support an accurate estimation of total match, the same behavior could be observed on Bing.com. the 217M results in the screen shot provided in the image tab above which is not very accurate either.
And, news has backend mechanism that any query output should be less than 100. So the total estimated matches number is not used properly in this example. Normally we do not allow user to download too many results of each query in news. The number of documents you could get from certain query actually capped at a certain number, in most of the case it is around 100.
I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.
The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)
"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
This is pretty well explained in the official documentation.
When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.
If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.
With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.
I created this visualisation to try and understand it myself.
There are two levels of aggregation errors:
Whole Aggregation - shows you the potential value of a missing term
Term Level - indicates the potential inaccuracy in a returned term
The first gives a value for the aggregation as a whole which
represents the maximum potential document count for a term which did
not make it into the final list of terms.
and
The second shows an error value for each term returned by the
aggregation which represents the worst case error in the document
count and can be useful when deciding on a value for the shard_size
parameter. This is calculated by summing the document counts for the
last term returned by all shards which did not return the term.
You can see the term level error by setting:
"show_term_doc_count_error": true
While the Whole Aggregation Error is shown by default
Quotes from official docs
set shardSize to int.MaxValue it will reduce errors in count