I have an elasticsearch query that returns me the correct results in sorted order (the highest relevancy is at the top and is accurate). However, the query also returns me a lot of results and beyond the top 4 or 5, the results seem less relevant.
My question is :
How to set a threshold such that only the most relevant results are
returned by the query
You can use the size param in your elasticsearch query to return your configured number of results. So in your example, if you think only top 5 results are relevant for you then, you can set this size param to 5.
Note, As you might know already that elasticsearch results are sorted according to their score already, hence using size 5 means top 5 relevant documents are returned to you.
Related
I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.
Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.
But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?
Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."
Hopefully someone can explain how things work under the hood and which important point I am missing.
There are at least two different contexts in which not all documents need to be sorted:
A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.
B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.
If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.
If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.
Relevant links:
https://github.com/elastic/elasticsearch/pull/24864
https://github.com/elastic/elasticsearch/issues/33028
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.
The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)
"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
This is pretty well explained in the official documentation.
When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.
If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.
With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.
I created this visualisation to try and understand it myself.
There are two levels of aggregation errors:
Whole Aggregation - shows you the potential value of a missing term
Term Level - indicates the potential inaccuracy in a returned term
The first gives a value for the aggregation as a whole which
represents the maximum potential document count for a term which did
not make it into the final list of terms.
and
The second shows an error value for each term returned by the
aggregation which represents the worst case error in the document
count and can be useful when deciding on a value for the shard_size
parameter. This is calculated by summing the document counts for the
last term returned by all shards which did not return the term.
You can see the term level error by setting:
"show_term_doc_count_error": true
While the Whole Aggregation Error is shown by default
Quotes from official docs
set shardSize to int.MaxValue it will reduce errors in count
Suppose I have an index for cars on a dealer's car lot. Each document resembles the following:
{
color: 'red',
model_year: '2015',
date_added: '2015-07-20'
}
Suppose I have a million cars.
Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars.
I could just use from and size to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on model_year and color (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set.
How do I limit my search to the most recently added 1000 documents for pagination and aggregation?
As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a match_all list of results. Even if you would use size at the query level, it will still not give you what you need because size is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches.
This feature request is not new and has been asked for before some time ago.
In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first terminate_after number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied.
In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the terminate_after is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after date_added and the query is just a match_all all the documents will have the same score and it will be returning an irrelevant set of documents.
In conclusion:
there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in sampler aggregation or with terminate_after and get a set of documents
my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.
The following query https://gist.github.com/anonymous/be27203a578494566a35 gives the following result set https://gist.github.com/anonymous/6935100dbf76b9a8f3e3. The documents has been indexed with the these settings https://gist.github.com/anonymous/ca42a7f67c7281935950.
As you can see, the queryNorm value for the documents in the result set varies. But according to the documentation (taken from http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/practical-scoring-function.html#query-norm):
The same query normalization factor is applied to every document and you have no way of changing it. For all intents and purposes, it can be ignored.
Unfortunately, since this does not seem to be true (or maybe I have misunderstood something), I do not get the desired result set for the query above. More specifically, I would expect the second document to have a higher relevance than the first, since there is a higher boosting factor if the query matches the "name" field compared to the "subtype" field. But, because the queryNorm factor is lower for the second document, the relevance score gets in total lower.
Why does the queryNorm behave this way?
Is there really no way of disabled it? (i.e setting the factor to 1)
I am running version 1.4.0 of Elasticsearch.
ElasticSearch builds the aggregation results based on all the hits of the query independently of the from and size parameters. This is what we want in most cases, but I have a particular case in which I need to limit the aggregation to the top N hits. The limits filter is not suitable as it does not fetch the best N items but only the first X matching the query (per shard) independently of their score.
Is there any way to build a query whose hit count has an upper limit N in order to be able to build an aggregation limited to those top N results? And if so how?
Subsidiary question: Limiting the score of matching documents could be an alternative even though in my case I would require a fixed bound. Does the min_score parameter affect aggregation?
You are looking for Sampler Aggregation.
I have a similar answer explained here
Optionally, you can use the field or script and max_docs_per_value
settings to control the maximum number of documents collected on any
one shard which share a common value.
If you are using an ElasticSearch cluster with version > 1.3, you can use top_hits aggregation by nesting it in your aggregation, ordering on the field you want and set the size parameter to X.
The related documentation can be found here.
I need to limit the aggregation to the top N hits
With nested aggregations, your top bucket can represent those N hits, with nested aggregations operating on that bucket. I would try a filter aggregation for the top level aggregation.
The tricky part is to make use the of _score somehow in the filter and to limit it exactly to N entries... There is a limit filter that works per shard, but I don't think it would work in this context.
It looks like Sampler Aggregation can now be used for this purpose. Note that it is only available as of Elastic 2.0.