Manipulate score in elasticsearch - elasticsearch

I would like to manipulate the score I get when I do a search on elasticsearch.
I already use the boost option, but it does not give me the results I would like to have. After some reading I think the function_score query is the solution to my problem.
I understand how it works, but I can’t figure out how I can change my current query to use it with the function_score query.
"query": {
"filtered": {
"query": {
"bool": {
"should": [{
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 2,
"fields": [
"fullname^2",
"fullname.folded",
"alias^2",
"name^2"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 1.9,
"fields": [
"taggings.tag.name^1.9",
"function",
"relations.master.name^1.9",
"relations.master.first_name^1.9",
"relations.master.last_name^1.9",
"relations.slave.name^1.9",
"relations.slave.first_name^1.9",
"relations.slave.last_name^1.9"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"fields": [
"fullname",
"alias",
"name"
],
"boost": 0.2,
"fuzziness": 1
}
}, {
"match": {
"extra": {
"query": "paus",
"fuzziness": 0,
"boost": 0.1
}
}
}]
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"type": ["Person"]
}
},
{
"term": {
"deleted": false
}
}
]
}
}
}
As you can see we have four kinds of matches.
Boost 2: when there are exact matches on the name
Boost 1.9: when there are exact matches on the taggings
Boost 0.2: when there are matches on the name but with one character written wrong
Boost 0.1: when there are matches in the extra (description) field
The problem I am facing is that the matches with one character written wrong and no tagging score higher than the matches with the right tagging and the whole word written wrong. That should be the other way...
Any help would be appreciated :)

There is no clear answer to this. Your best friend is Explain API,It will tell you how each and every document's score is calculated.
The most important thing to remember is boost is simply one of the factors considered while calculating score. From the Docs
Practically, there is no simple formula for deciding on the “correct” boost value for a particular query clause. It’s a matter of try-it-and-see. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors
It would help you a lot if you go through Theory and Lucene's Practical Scoring Function. This is the formula used by Lucene.
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
Now One of the several reasons you are not getting expected results could be norm(t,d) and idf(t)². For e.g if you have extra field as paus me and other fields have something like my name is some paus something, that would give field length norm i.e norm(t.d) higher value. Also if you have say 10000 documents and only one document has paus in extra field, that would make Inverse Document Frequency pretty high because it is calculated as idf(t) = 1 + log ( numDocs / (docFreq + 1)) here numDocs=10000 and docFreq=1 and this value will be squared. I had exactly this problem in my dataset.
Fuzzy query scoring higher could be related to this issue which is basically a Lucene Issue. This is fixed in latest version.
One way that might work is giving constant_score to last two clauses and say a boost of 5 to first two clauses. This would help in understanding.
Try to solve this issue step by step, start with two clauses and see output of explain api, then try with three and finally all four. Also remove field boosting and try with query boost only. Gradually you will figure out.
I hope this helps!!

Related

Elasticsearch - Impact of adding Boost to query

I have a very simple Elastic query mentioned below.
{
"query": {
"bool": {
"must": [
{
"bool": {
"minimum_should_match": 1,
"should": [
{
"match": {
"tag": {
"query": "Audience: PRO Brand: Samsung",
"boost": 3,
"operator": "and"
}
}
},
{
"match": {
"tag": {
"query": "audience: PRO brand samsung",
"boost": 2,
"operator": "or"
}
}
}
]
}
}
]
}
}
}
I want to know if I add a boost in the query, will there be any performance impact because of this, and also will boosting help if you have a very large data set, where the occurrence of a search word is common.
Elasticsearch adds boost param with default value, IMO giving different value won't make much difference in the performance, but you should be able to measure it yourself.
Reg. your second question, adding boost definitely makes sense where the occurrence of your search words are common, this will help you to find the relevant document. for example: suppose you are searching for query in a index containing Elasticsearch posts(query will be very common on Elasticsearch posts), but you want the give more weight to documents which have tag elasticsearch-query. Adding boosts in this case, will provide you more relevant results.

elasticsearch giving more weight to different fields and scenarios

I have this ES query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "test",
"fields": [
"name^-1.0",
"id^-1.0",
"address.city^-1.0",
"address.street^-1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1.0,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
and currently what happens is, when I search for person with the name john, I will get bunch of results that the id, address.city, address.street contains john in them, which is fine, but I want name to be more important, and also if I have in the es 2 people john and someone with 2 names like george john I would want the just john to come up first.
can I do that? :)
To make any field more important than other(s), you can set its boost to a higher value. So if fieldA^4 and fieldB^1 it implies that fieldA is 4 times more important than fieldB. Therefore you can give higher boost value to name field to make it more important for scoring.
For second point the document with name field value as john will have higher score than with a document having name field value as george john (assuming that other fields have same data in both documents). The reason you are get the second doc (george john) higher in result is because you have boosted all the fields with negative value.
So to cater to both of your points
give higher boost to name
make boost for all fields as positive value.
So the query should look as below:
{
//"explain": true,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "john",
"fields": [
"name^4.0",
"id^1.0",
"address.city^1.0",
"address.street^1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
To understand more on how the score for the matching document is calculated by elastic, you can use the "explain": true in your query. This will give detailed steps in result, taken by elastic to calculate the score.

How to limit the results in a multi match query?

i had used multi match phrase when I make search. However I have to put limit result of all math phrase seperately. I mean, I want to take only 2 result for each multi match. I can't find any limit/size attributes. Do you know any solution?
Example Code:
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": " Home is clear and big ",
"slop": 2
}
}
},
{
"match_phrase": {
"text": {
"query": "365 different company use our system in test",
"slop": 2
}
}
}
]}}
use
{"limit" : 3, "from":0, "query": ...}
The simplest solution is to make to individual searches for each of the conditions. The size parameter can be set to retrieve only the first 2 results for each query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
The boolean should query will not distinguish which condition has been satisfied: it returns documents for which at least one of the two conditions holds. The scores for the two matches will be combined into a single score but it will be impossible to tell which s

Elasticsearch rescore all results ignoring base score

I'm trying to rescore my results with the following query:
POST /archive/item/_search
{
"query": {
"multi_match": {
"fields": ["title", "description"],
"query": "1 złoty",
"operator": "and"
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"multi_match": {
"type": "phrase",
"fields": ["title", "description"],
"query": "1 złoty",
"slop": 10
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}
I'm doing this because I want to score by proximity mainly.
Also, I want to ignore source field length impact on the score.
Am I doing this right? If not, what's the best practice here?
And the second question. Why window_size is needed anyway?
I don't want top results only.
The main query atcs like a filter, so all the results it returns are relevant.
I quess something like "window_size": "all" would be perfect, but I couldn't find anything in the docs.
To answer your second question, the reason it's needed is because it's designed to be for top results only. Basically it's a cost issue - the assumption is that the secondary algorithm is more expensive so it was only designed to be run on the top results. There's more discussion about this here:
https://github.com/elasticsearch/elasticsearch/issues/2640
and here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-rescore.html
Personally I think the "all" option is a great idea, maybe you should open an issue on github?
If you want to score with proximity match all results returned by some other filter this should do:
{
"query": {
"filtered" : {
"query" : {
"multi_match": {
"type": "phrase",
"fields": ["title", "description"],
"query": "1 złoty",
"slop": 10
}
},
"filter" : {
"query": {
"multi_match": {
"fields": ["title", "description"],
"query": "1 złoty",
"operator": "and"
}
}
}
}
}
}
According to this, the filter is run before the query, so the performance shouldn't be bad as well. What's more you don't score twice, because filters don't calculate scores. Another advantage is that filters can be cached which should speed things significantly.
Keep in mind that I did short tests only, mostly focusing on syntax not results. You might want to double check it.

Adjusting Elasticsearch _score based on field value, relative to other matching document's field value

We're updating our search system from Solr to Elasticsearch. We've already improved lots of things, but something we haven't got right yet is boosting a document's (product's) score by the popularity of the product (it's an ecommerce website).
This is what we have currently (with lots of irrelevant bits stripped out):
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query": "renal dog food",
"fields": [ "family_name^20", "parent_categories^2", "description^0.2", "product_suffixes^8", "facet_values^5" ],
"operator": "and",
"type": "best_fields",
"tie_breaker": 0.3
}
},
"functions": [{
"script_score": {
"script": "_score * log1p(1 + doc['popularity_score'].value)"
}
}],
"score_mode": "sum"
}
},
"sort": [
{ "_score": "desc" }
],
}
The popularity_score field contains the total number of orders containing this item in the last 6 weeks. Some items will have never been ordered and some will have had up 30,000 (with potentially a lot more as we continue to grow the business). It's quite a bit range.
The problem we have is that a document (product) might be a really good match text-wise but not very popular. We then have another not-very-relevant product does just about match the query, but because it is very popular it jumps up the list. What we are looking for is something will allow the popularity_score to be taken relative to the popularity_score of other matching results and get some form of normalisation, rather than just being taken as is (log1p doesn't seem to be enough sometimes). Does anyone have any suggestions or ideas?
Thank you!

Resources