Could anyone advice me on how to do custom scoring in ElasticSearch when searching for an array of keywords from an array of keywords?
For example, let's say there is an array of keywords in each document, like so:
{ // doc 1
keywords : [
red : {
weight : 1
},
green : {
weight : 2.0
},
blue : {
weight: 3.0
},
yellow : {
weight: 4.3
}
]
},
{ // doc 2
keywords : [
red : {
weight : 1.9
},
pink : {
weight : 7.2
},
white : {
weight: 3.1
},
]
},
...
And I want to get scores for each documents based on a search that matches keywords against this array:
{
keywords : [
red : {
weight : 2.2
},
blue : {
weight : 3.3
},
]
}
But instead of just determining whether they match, I want to use a very specific scoring algorithm:
Scoring a single field is easy enough, but I don't know how to manage it with arrays. Any thoughts?
Ah an interesting question! (And one I think we can solve with some communication)
Firstly, have you looked at custom script scoring? I'm pretty sure you can do this slowly with that. If you were to do this I would consider doing a rescore phase where scoring is only calculated after the doc is known to be a hit.
However I think you can do this with elasticsearch machinery. As I can work out you are doing a dot-product between docs, (where the weights are actually half way between what you are specifying and 1).
So, my first suggestion remove the x/2n term from your "custom scoring" (dot product) and put your weights half way between 1 and the custom weight (e.g. 1.9 => 1.45).
... I'm sorry I will have to come back and edit this question. I was thinking about using nested docs with a field defined boost level, but alas, the _boost mapping parameter is only available for the root doc
p.s. Just had a thought, you could have fields with defined boost levels and store teh terms there, then you can do this easily but you loose precision. A doc would then look like:
{
"boost_1": ["aquamarine"],
"boost_2": null, //don't need to send this, just showing for clarity
...
"boost_5": ["burgundy", "fuschia"]
...
}
You could then define a these boostings in your mapping. One thing to note is a fields boost value carries over to the _all field, so you would now have a bag of weighted terms in your _all field, then you could construct a bool: should query, with lots of term queries with different boost (for the weights of the second doc).
Let me know what you think! A very, very interesting question.
Related
Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}
Suppose I have 3 Documents:
A, B, C
All terms are very similar, and internal score is pretty identical
And people searching B more frequently than A and C, but C more frequently than A.
Can I get score to order like B, C, A ?
In the use case you described, you can not change the behaviour of scoring/relevancy computation. There is the possibility of boosting when using e.g. match query to affect scoring when searching for values. But that wouldn't be appropriate since you only want to sort the documents.
So the information about the search frequency has to be a part of the documents themselves, meaning it has to be an own field. Then you can simply add a sort clause like the following
{
"query": {
// your awesome query...
},
"sort": [
{
"search_frequency": {
"order": "desc"
}
},
"_score"
]
}
The challenge in this solution would be to keep the value of the field search_frequency up to date. You can do that via the Update API.
For my project I need to find out which results of the searches are considered "good" matches. Currently, the scores vary wildly depending on the query, hence the need to normalize them somehow. Normalizing the scores would allow to select the results above a given threshold.
I found couple solutions for Lucene:
how do I normalise a solr/lucene score?
http://wiki.apache.org/lucene-java/ScoresAsPercentages
How would I go ahead and apply the same technique to ElasticSearch? Or perhaps there is already a solution that works with ES for score normalization?
As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of the pilot query in params to function_score and use it to normalize every score. Refer: This article snippet
It's a bit late.
We needed to normalise the ES score for one of our use cases. So, we wrote a plugin that overrides the ES Rescorer feature.
Supports min-max and z score.
Github: https://github.com/bkatwal/elasticsearch-score-normalizer
Usage:
Min-max
{
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "min_max",
"min_score" : 1,
"max_score" : 10
}
}
}
Usage z-score:
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "z_score",
"min_score" : 1,
"factor" : 0.6,
"factor_mode" : "increase_by_percent"
}
}
}
For complete documentation check the Github repository.
My use case is like this:
for a query iphone charger, I am getting higher relevance for results, having name, iphone charger coupons than with name iphone charger, possibly because of better match in description and other fields. Boosting name field isn't helping much unless I skew the importance drastically. what I really need is tf/idf boost within name field
to quote elasticsearch blog:
the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way. It will treat all title fields (because they are short) as more important than all body fields (because they are long).
I need to boost this more important value for a particular field. Can we do this with function score or any other way?
A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). If there are hits on other fields, you have a lot of scoring elements to fight against.
A dis_max would probably be a reasonable approach to this. Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. If you boost the query against title, you can ensure matches there are strongly preferred.
You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied.
{
"dis_max" : {
"tie_breaker" : 0.2,
"queries" : [
{
"terms" : {
"age" : ["iphone", "charger"],
"boost" : 10
}
},
{
"terms" : {
"description" : ["iphone", "charger"]
}
}
]
}
}
Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field, is to separately index an untokenized version of that field, and query that field as well. Any match against the untokenized version of the field will be an exact match again the entire field contents. This would prevent you needing to relying on the length norm to make that determination.
I wonder how Elastic search is sorting (on what field) when no search query is specified (I just filter on documents) and no sort option specified. It looks like sorting is than random ... Default sort order is _score, but score is always 1 when you do not specify a search query ...
You got it right. Its then more or less random with score being 1. You still get consistent results as far as I remember. You have the "same" when you get results in SQL but don't specify ORDER BY.
Just in case someone may see this post even it posted over 6 yrs ago..
When you wanna know how elasticsearch calculate its own score known as _score, you can use the explain option.
I suppose that your query(with filter & without search) might like this more or less (but the point is making the explain option true) :
POST /goods/_search
{
"explain": true,
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"maker_name": "nike"
}
}
}
}
}
As running this, you will notice that the _explaination of each hits describes as below :
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(maker_name:nike)",
"details" : [ ]
}
which means ES gave constant score to all of the hits.
So to answer the question, "yes".
The results are sorted kinda randomly because all the filtered results have same (constant) score without any search query.
By the way, enabling an explain option is more helpful when you use search queries. You will see how ES calculates the score and will understand the reason why it returns in that order.
Score is mainly used for sorting, Score is calculated by lucene score calculating using several constraints,For more info refer here .