Elasticsearch fuzzy query - max_expansions - elasticsearch

I am using elasticsearch 5+, I did some queries using fuzzy.
I Understood about the follows fuzzy parameters:
fuzziness, prefix_length.
But, I can not understand about "max_expansions", I read many articles, but it is hard to me because there are fews examples about it.
Can you explanation me about this parameter using examples? How it works together fuzziness parameter?
Write an example:
I did this query:
GET my-index/my-type/_search
{
"query": {
"fuzzy": {
"my-field": {
"value": "house",
"fuzziness": 1,
"prefix_length": 0,
"max_expansions": 1
}
}
}
}
I have 4 shards, my query found 6 results, because there are 6 documents with "hous" in the "my-field".
If max_expansions it is like as limit in database, the max result should be 4 (because I have 4 shards)? Why return 6 results?

A quote from Elasticsearch blog post:
The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query. Cutting down the query terms has a negative effect, however, in that some valid results may not be found due to early termination of the query. It is important to understand that the max_expansions query limit works at the shard level, meaning that even if set to 1, multiple terms may match, all coming from different shards. This behavior can make it seem as if max_expansions is not in effect, so beware that counting unique terms that come are returned is not a valid way to determine if max_expansions is working.
Basically it means that under the hood in one step when Elasticsearch is triggering fuzzy query it is limiting the number of terms considered in search to the max_expansions. As it was written it is not so obvious as for example limit in databases because here, in Elasticsearch it is working on shards. Probably more expected results you will have setting up Elasticsearch only with one shard locally and testing the behavior.

Related

How to boost most popular (top quartile) in elasticsearch query results (outliers)

I have an elasticsearch query that includes bool - must / should sections that I have refined to match search terms and boost for terms in priority fields, phrase match, etc.
I would like to boost documents that are the most popular. The documents include a field "popularity" that indicates the number of times the document was viewed.
Preferably, I would like to boost any documents in the result set that are outliers - meaning that the popularity score is perhaps 2 standard deviations from the average in the result set.
I see aggregations but I'm interested in boosting results in a query, not a report/dashboard.
I also noted the new rank_feature query in ES 7 (I am still on 6.8 but could upgrade). It looks like the rank_feature query looks across all documents, not the result set.
Is there a way to do this?
I think that you want to use a rank or a range query in a "rescore query".
If your need is to specific for classical queries, you can use a "function_score" query in your rescore and use a script to write your own score calculation
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/filter-search-results.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-rescore.html

Speeding up elasticsearch more_like_this query

I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.
Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.
But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?
There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.
For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.
So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "yes"
}
}
}
}

How to find if a document is a good match for a query, e.g., normalize elasticsearch score?

The score computed by Elasticsearch provides a ranking between the documents, but it does not tell if the documents are a good match for the request. Currently, the first document can either match on all fields or just one. The only information that the score provides is that it is the best match.
Would it be possible to get a normalized score with respect to the query ? For example, a score of 1 would be a document matching perfectly the query and a score of 0.1 a document matching poorly.
In short, no, it is not possible to get a real normalized score for a query, but it is possible to get a good enough score normalization that works in many cases.
The problem to get a score that tells if the document is a good match or not for a query is to find what would be the best document for this query, and consequently the maximum score. Using elasticsearch and most (if not all) metrics, the maximum score is not bounded.
Even with a simple match query, you can technically reach an infinite score with a document that repeat the queried term an infinite number of time. Without bound on the score, it is not possible to get a true normalized score.
But all hopes are not lost. Instead of normalizing against the best possible score you can normalize against a fake ideal document which is supposed to get the maximum score. For example, if you are querying two fields name and occupation with queried terms Jane Doe and Cook your ideal document can be
{
"name": "Jane Doe",
"occupation": "Cook"
}
If the index contains a document with for example the name Jane Jane Doe then the ideal document may not get the maximum score. If the queried fields are relatively short, you probably do not have to worry about term duplication. If you have fields with many terms you may decide to duplicate some terms which are frequent in the ideal document. If the objective is to find if the document is a good match or not, it is usually not a problem to have a document scored higher than the ideal document.
The good news is that if you are using at least elasticsearch 6.4 you do not have to index the fake document to get its score for a query. You may use the endpoint _scripts/painless/_execute to obtain the score of the ideal document.
GET _scripts/painless/_execute
{
"script": {
"source": "_score"
},
"context": "score",
"context_setup": {
"index": <INDEX>,
"document": <THE_IDEAL_DOCUMENT>,
"query": <YOUR_QUERY>
}
}
Please note that the fields statistics of the fake document such as the number of documents containing a field and the number of fields containing the queried term will be taken into account when computing the score. If you have many documents, this should not be a problem, but for very not frequent field or term (say below 20) you can notice a lower score for the ideal document compared to a previously indexed document.

What is the significance of doc_count_error_upper_bound in elasticsearch and how can it be minimized?

I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.
The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)
"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
This is pretty well explained in the official documentation.
When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.
If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.
With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.
I created this visualisation to try and understand it myself.
There are two levels of aggregation errors:
Whole Aggregation - shows you the potential value of a missing term
Term Level - indicates the potential inaccuracy in a returned term
The first gives a value for the aggregation as a whole which
represents the maximum potential document count for a term which did
not make it into the final list of terms.
and
The second shows an error value for each term returned by the
aggregation which represents the worst case error in the document
count and can be useful when deciding on a value for the shard_size
parameter. This is calculated by summing the document counts for the
last term returned by all shards which did not return the term.
You can see the term level error by setting:
"show_term_doc_count_error": true
While the Whole Aggregation Error is shown by default
Quotes from official docs
set shardSize to int.MaxValue it will reduce errors in count

Terms aggregation performance high cardinality

I'm hoping for some clarifications for query performance in elastic 1.5.2 I observed recently.
I have a string field with high cardinality (approx 200,000,000).
I observed that if I use a simple terms aggregation with execution hint global_ordinals_low_cardinality, two things happen:
1. The query returns same results as with global_ordinals, or global_ordinals_hash.
2. The query performs significantly faster. (about twice as fast as global_ordinals, and 4 times as fast as global_ordinals_hash.
Here's the query:
{
"aggs": {
"myTerms": {
"terms": {
"field": "myField",
"size": 1000,
"shard_size": 1000,
"execution_hint": "global_ordinals_low_cardinality"
}
}
}
}
I don't understand why its even legitimate to use global_ordinals_low_cardinality in this instance, because my field has a high cardinality. So perhaps I don't understand what exactly global_ordinals_low_cardinality means?
Secondly, I have another numerical field (long), with roughly same cardinality value.
The values of the long field are actually precomputed hash values (murmur3) for the same string field from above, which i use to greatly speedup cardinality aggregation.
Running the same terms aggregation on the numerical field performs as bad as global_ordinals_hash.
In fact, it doesn't matter what execution hint i use, execution time remains the same.
So why is global_ordinals_low_cardinality applicable for string types, but not for long types? Is it because numerical fields do not require global ordinals at all?
Thanks
I think both the official documentation and the source code shed some light on this. First off, one thing that needs to be mentioned is that execution_hint is exactly what its name says, i.e. just an hint that ES will try to honor, but might not in all cases if it deems not appropriate.
So, the mere fact that you have a high cardinality field precludes the use of global_ordinals_low_cardinality since:
global_ordinals_low_cardinality is only enabled by default on low-cardinality fields.
As for global_ordinals_hash, it is mainly used in inner terms aggregations (not the case here), and map is only used when running an aggregation on scripts and few documents match the query (not the cas either)
So it leaves you with one choice, i.e. global_ordinals which is the default hint used on top-level terms aggregations.
As mentioned earlier, execution_hint is just an hint you specify, but ES will try its best to pick the right execution mode for your data no matter what. Looking into the source code is enlightening and sheds some light on a few things:
Starting here, you'll see that:
on line 201, your hint is read, and then might get overridden if the field doesn't support global ordinals (lines 205-241)
if your field is numeric, execution_hint is completely ignored, and this should probably answer your question.

Resources