How to normalize ElasticSearch scores? - elasticsearch

For my project I need to find out which results of the searches are considered "good" matches. Currently, the scores vary wildly depending on the query, hence the need to normalize them somehow. Normalizing the scores would allow to select the results above a given threshold.
I found couple solutions for Lucene:
how do I normalise a solr/lucene score?
http://wiki.apache.org/lucene-java/ScoresAsPercentages
How would I go ahead and apply the same technique to ElasticSearch? Or perhaps there is already a solution that works with ES for score normalization?

As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of the pilot query in params to function_score and use it to normalize every score. Refer: This article snippet

It's a bit late.
We needed to normalise the ES score for one of our use cases. So, we wrote a plugin that overrides the ES Rescorer feature.
Supports min-max and z score.
Github: https://github.com/bkatwal/elasticsearch-score-normalizer
Usage:
Min-max
{
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "min_max",
"min_score" : 1,
"max_score" : 10
}
}
}
Usage z-score:
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "z_score",
"min_score" : 1,
"factor" : 0.6,
"factor_mode" : "increase_by_percent"
}
}
}
For complete documentation check the Github repository.

Related

Is there a difference between using search terms and should when querying Elasticsearch

I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.

What is the difference between `constant_score + filter` and `term` query?

I have two queries in Elasticsearch:
{
"term" : {
"price" : 20
}
}
and
"constant_score" : {
"filter" : {
"term" : {
"price" : 20
}
}
}
They are returning the same query result. I wonder what the main difference between them. I read some articles about scoring document. And I believe both queries are scoring document. The constant_score will use default score 1.0 to match the document's score. So I don't see much difference between these two.
The results would be exactly the exact.
However, the biggest difference is that the constant_score/filter version will cache the results of the term query since it's run in a filter context. All future executions will leverage that cache. Also, one feature of the constant_score query is that the returned score is always equal to the given boost value (which defaults to 1)
The first query will be run outside of the filter context and hence not benefit from the filter cache.

How is Elastic Search sorting when no sort option specified and no search query specified

I wonder how Elastic search is sorting (on what field) when no search query is specified (I just filter on documents) and no sort option specified. It looks like sorting is than random ... Default sort order is _score, but score is always 1 when you do not specify a search query ...
You got it right. Its then more or less random with score being 1. You still get consistent results as far as I remember. You have the "same" when you get results in SQL but don't specify ORDER BY.
Just in case someone may see this post even it posted over 6 yrs ago..
When you wanna know how elasticsearch calculate its own score known as _score, you can use the explain option.
I suppose that your query(with filter & without search) might like this more or less (but the point is making the explain option true) :
POST /goods/_search
{
"explain": true,
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"maker_name": "nike"
}
}
}
}
}
As running this, you will notice that the _explaination of each hits describes as below :
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(maker_name:nike)",
"details" : [ ]
}
which means ES gave constant score to all of the hits.
So to answer the question, "yes".
The results are sorted kinda randomly because all the filtered results have same (constant) score without any search query.
By the way, enabling an explain option is more helpful when you use search queries. You will see how ES calculates the score and will understand the reason why it returns in that order.
Score is mainly used for sorting, Score is calculated by lucene score calculating using several constraints,For more info refer here .

How can I get fuzzy results sorted by relevance using ElasticSearch?

Baiscally if I search for something like
neckla
I want to get results for necklace. When i did a fuzzy search, I got results for food and others (basically it wasn't filtering properly) although the necklace result was there.
Any advice?
Have you adjusted the min_similarity parameter in your fuzzy query? Increasing the min_similarity may filter out the food and only return the necklace.
{
"fuzzy" : {
"jewelry" : {
"value" : "neckla",
"boost" : 1.0,
"min_similarity" : 0.8,
"prefix_length" : 0
}
}
}
As mentioned in the answer to this question, an edgeNGram filter may be your best bet here.

ElasticSearch Custom Scoring with Arrays

Could anyone advice me on how to do custom scoring in ElasticSearch when searching for an array of keywords from an array of keywords?
For example, let's say there is an array of keywords in each document, like so:
{ // doc 1
keywords : [
red : {
weight : 1
},
green : {
weight : 2.0
},
blue : {
weight: 3.0
},
yellow : {
weight: 4.3
}
]
},
{ // doc 2
keywords : [
red : {
weight : 1.9
},
pink : {
weight : 7.2
},
white : {
weight: 3.1
},
]
},
...
And I want to get scores for each documents based on a search that matches keywords against this array:
{
keywords : [
red : {
weight : 2.2
},
blue : {
weight : 3.3
},
]
}
But instead of just determining whether they match, I want to use a very specific scoring algorithm:
Scoring a single field is easy enough, but I don't know how to manage it with arrays. Any thoughts?
Ah an interesting question! (And one I think we can solve with some communication)
Firstly, have you looked at custom script scoring? I'm pretty sure you can do this slowly with that. If you were to do this I would consider doing a rescore phase where scoring is only calculated after the doc is known to be a hit.
However I think you can do this with elasticsearch machinery. As I can work out you are doing a dot-product between docs, (where the weights are actually half way between what you are specifying and 1).
So, my first suggestion remove the x/2n term from your "custom scoring" (dot product) and put your weights half way between 1 and the custom weight (e.g. 1.9 => 1.45).
... I'm sorry I will have to come back and edit this question. I was thinking about using nested docs with a field defined boost level, but alas, the _boost mapping parameter is only available for the root doc
p.s. Just had a thought, you could have fields with defined boost levels and store teh terms there, then you can do this easily but you loose precision. A doc would then look like:
{
"boost_1": ["aquamarine"],
"boost_2": null, //don't need to send this, just showing for clarity
...
"boost_5": ["burgundy", "fuschia"]
...
}
You could then define a these boostings in your mapping. One thing to note is a fields boost value carries over to the _all field, so you would now have a bag of weighted terms in your _all field, then you could construct a bool: should query, with lots of term queries with different boost (for the weights of the second doc).
Let me know what you think! A very, very interesting question.

Resources