I have the intention to use the Terminate After feature of elasticsearch in order to reduce the result set.
The question is, the documents retrieved when using Terminate After, are ranked among the complete set of documents, or just among the reduced returned set?
Terminate after limits the number of search hits per shard so any document that may have had a hit later could also have had a higher ranking(higher score) than highest ranked document returned since the score used for ranking is independent of the other hits.
So yes the document will be ranked depending upon only the result set returned, but this would not affect how the actual score was calculated which takes into account all the documents.
Wanting a reduced result set and wanting it to be ranked depending on all the hits that may have occurred is a contradiction in itself.
Terminate after is generally used for filter type queries where the score of all returned docs is the same so that ranking doesn't matter.
For match type queries ES uses pagination so it's already quite efficient and you don't really need to restrict the document set anyways.
Related
Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.
I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.
Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.
But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?
Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."
Hopefully someone can explain how things work under the hood and which important point I am missing.
There are at least two different contexts in which not all documents need to be sorted:
A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.
B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.
If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.
If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.
Relevant links:
https://github.com/elastic/elasticsearch/pull/24864
https://github.com/elastic/elasticsearch/issues/33028
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
I'm not sure if I've understood the Term Vectors API correctly.
The document starts by saying:
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
Setting field_statistics to false (default is true) will omit :
document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)
I guess they are simply the sum over their corresponding values reported in term statistics?
Then in the section Behavior it says:
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
term and token are synonyms and simply mean whatever came out of the analysis process and has been indexed in the Lucene inverted index.
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
By default, the call returns term information and field statistics, but term statistics have to be requested explicitly with &term_statistics=true.
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
payload is a Lucene concept, which is pretty well explained here. Term payloads are not available unless your have a custom analyzer that makes use of a delimited-payload token filter to extract them.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
[...]
I guess they are simply the sum over their corresponding values reported in term statistics?
The sum of "document frequencies" is the number of times each term present in the field appears in the same document. So if the field contains "big brown fox", it will count the number of times "big" appears in the same document, the number of times "brown" appears in the same document and the same for "fox".
The sum of "total term frequencies" is the number of times each term present in this field appears in all documents present in the Lucene index (which is located on a single shard of an ES index). So if the field contains "big brown fox", it will count the number of times "big" appears in all documents, the number of times "brown" appears in all documents and the same for "fox".
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
It is realtime by default, which means that a refresh call is made when issuing the _termvectors call in order to get fresh information from the Lucene index. However, statistics are gathered only from a single shard, which does not give an overall view of the statistics of the whole ES index (potentially made of several shards, hence several Lucene indexes).
Imagine i have two kind of records: a bucket and an item, where item is contained in a bucket, and bucket may have relatively small amount of items (normally not more than 4, never more than 10). Those records are squashed into one (an item with extra bucket information) and placed inside Elasticsearch.
The task i am trying to solve is to find 500 buckets (at max) with all related items at once by filtered query that relies on item's attributes, and i'm stuck on limiting / offsetting aggregations. How do i perform such kind of task? I see top_hits aggregation which allows me to control size of related items amount, but i can't find a clue how can i control size of returned buckets.
update: okay, i'm terribly stupid. The size parameter of terms aggregation provides me with limiting. Is there any way to perform offset task? I don't need 100% precision and probably won't ever page those results, but anyway i'd like to see this functionality.
I don't think we'll be seeing this feature any time soon, see relevant discussion at GitHub.
Paging is tricky to implement because document counts for terms
aggregations are not exact when shard_size is less than the field
cardinality and sorting on count desc. So weird things may happen like
the first term of the 2nd page having a higher count than the last
element of the first page, etc.
There an interesting approach is mentioned, you could request like top 20 results on 1st page, then on 2nd page you run the same aggregation but exclude those 20 terms you already saw on the previous page and so forth. But this doesn't allow you "random" access to arbitrary page, you must go through pages in-order.
...if you only have a limited number of unique values compared to the
number of matched documents, doing the paging on client-side would be
more efficient. On the other hand, on high-cardinality-fields, your
first approach based on an exclude would probably be better.
Is it possible to implement reliable paging of elasticsearch search results if multiple documents have equal scores?
I'm experimenting with custom scoring in elasticsearch. Many of the scoring expressions I try yield result sets where many documents have equal scores. They seem to come in the same order each time I try, but can it be guaranteed?
AFAIU it can't, especially not if there is more than one shard in a cluster. Documents with equal score wrt. a given elasticsearch query are returned in random, non-deterministic order that can change between invocations of the same query, even if the underlying database does not change (and therefore paging is unreliable) unless one of the following holds:
I use function_score to guarantee that the score is unique for each document (e.g. by using a unique number field).
I use sort and guarantee that the sorting defines a total order (e.g. by using a unique field as fallback if everything else is equal).
Can anyone confirm (and maybe point at some reference)?
Does this change if I know that there is only one primary shard without any replicas (see other, similar querstion: Inconsistent ordering of results across primary /replica for documents with equivalent score) ? E.g. if I guarantee that there is one shard AND there is no change in the database between two invocations of the same query then that query will return results in the same order?
What are other alternatives (if any)?
I ended up using additional sort in cases where equal scores are likely to happen - for example searching by product category. This additional sort could be id, creation date or similar. The setup is 2 servers, 3 shards and 1 replica.