elasticsearch giving more weight to different fields and scenarios - elasticsearch

I have this ES query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "test",
"fields": [
"name^-1.0",
"id^-1.0",
"address.city^-1.0",
"address.street^-1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1.0,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
and currently what happens is, when I search for person with the name john, I will get bunch of results that the id, address.city, address.street contains john in them, which is fine, but I want name to be more important, and also if I have in the es 2 people john and someone with 2 names like george john I would want the just john to come up first.
can I do that? :)

To make any field more important than other(s), you can set its boost to a higher value. So if fieldA^4 and fieldB^1 it implies that fieldA is 4 times more important than fieldB. Therefore you can give higher boost value to name field to make it more important for scoring.
For second point the document with name field value as john will have higher score than with a document having name field value as george john (assuming that other fields have same data in both documents). The reason you are get the second doc (george john) higher in result is because you have boosted all the fields with negative value.
So to cater to both of your points
give higher boost to name
make boost for all fields as positive value.
So the query should look as below:
{
//"explain": true,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "john",
"fields": [
"name^4.0",
"id^1.0",
"address.city^1.0",
"address.street^1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
To understand more on how the score for the matching document is calculated by elastic, you can use the "explain": true in your query. This will give detailed steps in result, taken by elastic to calculate the score.

Related

Search for two fields but only score once in Elasticsearch

Let's say I have these documents in Elasticsearch:
{
"display_name": "Jose Cummings",
"username": "josecummings"
},
{
"display_name": "Jose Ramirez",
"username": "elite_gamer"
},
{
"display_name": "Lance Abrams",
"username": "abrams1"
},
{
"display_name": "Steve Smith",
"username": "josesmose"
}
I want to run a "as you type" search for Jose that searches against both the display_name and the username fields, which I can do with this:
{
"query": {
"bool": {
"must": {
"multi_match": {
"fields": [
"display_name",
"username"
],
"query": "Jose",
"type": "bool_prefix",
"fuzziness": "AUTO",
"boost": 50
}
}
}
}
}
The issue here is that when I search for Jose, Jose Cummings gets 100 points while Jose Ramirez and Steve Smith only get 50 points, because it seems to sum the scores for the two fields. This essentially rewards a user for having the same display_name as username, which we do not want to happen.
Is there a way to only take the max score from the two fields? I've tried dozens of different combinations now using function_score, boost_mode/score_mode, constant_score, trying to do a should match with multiple match_bool_prefix queries, etc. Nothing I've tried seems to achieve this.
Try this:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": [
"display_name^50",
"username^50"
],
"query": "Jose",
"type": "bool_prefix",
"fuzziness": "AUTO",
"tie_breaker": 0.3
}
}
]
}
}
}
Notice the effects of the tie_breaker being set to 0.0 as opposed to 0<x<1 and x=1.
Also note that your bool_prefix
scoring behaves like most_fields, but using a match_bool_prefix query instead of a match query.
Perhaps you indeed want the fields to be prefixed w/ jose. But if the username is, say, cool_jose, it's going to get left out (unless you for example apply an other-than-standard analyzer)...

ElasticSearch: obtaining individual scores from each query inside of a bool query

Assume I have a compound bool query with various "must" and "should" statements that each may include different leaf queries including "multi-match" and "match_phrase" queries such as below.
How can I get the score from individual queries packed into a single query?
I know one way could be to break it down into multiple queries, execute each, and then aggregate the results in code-level (not query-level). However, I suppose that is less efficient, plus, I lose sorting/pagination/.... features from ElasticSearch.
I think "Explanation API" is also not useful for me since it provides very low-level details of scoring (inefficient and hard to parse) while I just need to know the score for each specific leaf query (which I've also already named them)
If I'm wrong on any terminology (e.g. compound, leaf), please correct me. The big picture is how to obtain individual scores from each sub-query inside of a bool query.
PS: I came across Different score functions in bool query. However, it does not return the scores. If I wrap my queries in "function_score", I want the scoring to be default but obtain the individual scores in response to the query.
Please see the snippet below:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "...",
"fields": [
"field1^3",
"field2^5"
],
"_name": "must1_mm",
"boost": 3
}
}
],
"should": [
{
"multi_match": {
"query": "...",
"fields": [
"field3^2",
"field4^5"
],
"boost": 2,
"_name": "should1_mm",
"boost": 2
}
},
{
"match_phrase": {
"field5": {
"_name": "phrase1",
"boost": 1.5,
"query": "..."
}
}
},
{
"match_phrase": {
"field6": {
"_name": "phrase2",
"boost": 1,
"query": "..."
}
}
}
]
}
}
}```

How to force certain fields in mult_match to have exact match

I am trying to match the title of a product listing to a database of known products. My first idea was to put the known products and their metadata into elasticsearch and try to find the best match with multi_match. My current query is something like:
{
"query": {
"multi_match" : {
"query": "Men's small blue cotton pants SKU123",
"fields": ["sku^2","title","gender","color", "material","size"],
"type" : "cross_fields"
}
}
}
The problem is sometimes it will return products with the wrong color. Is there a way i could modify the above query to only score items in my index that have a color field equal to a word that exists in the query string? I am using elasticsearch 5.1.
If you want elasticsearch to score only items that meet certain criteria then you need to use the terms query in a filter context.
Since the terms query does not analyze your query, you'll have to do that yourself. Something simple would be to tokenize by whitespace and lowercase and generate a query that looks like this:
{
"query": {
"bool": {
"filter": {
"terms": {
"color": ["men's", "small", "blue", "cotton", "pants", "sku123"]
}
},
"must": {
"multi_match": {
"query": "Men's small blue cotton pants SKU123",
"fields": [
"sku^2",
"title",
"gender",
"material",
"size"
],
"type": "cross_fields"
}
}
}
}
}

Manipulate score in elasticsearch

I would like to manipulate the score I get when I do a search on elasticsearch.
I already use the boost option, but it does not give me the results I would like to have. After some reading I think the function_score query is the solution to my problem.
I understand how it works, but I can’t figure out how I can change my current query to use it with the function_score query.
"query": {
"filtered": {
"query": {
"bool": {
"should": [{
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 2,
"fields": [
"fullname^2",
"fullname.folded",
"alias^2",
"name^2"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 1.9,
"fields": [
"taggings.tag.name^1.9",
"function",
"relations.master.name^1.9",
"relations.master.first_name^1.9",
"relations.master.last_name^1.9",
"relations.slave.name^1.9",
"relations.slave.first_name^1.9",
"relations.slave.last_name^1.9"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"fields": [
"fullname",
"alias",
"name"
],
"boost": 0.2,
"fuzziness": 1
}
}, {
"match": {
"extra": {
"query": "paus",
"fuzziness": 0,
"boost": 0.1
}
}
}]
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"type": ["Person"]
}
},
{
"term": {
"deleted": false
}
}
]
}
}
}
As you can see we have four kinds of matches.
Boost 2: when there are exact matches on the name
Boost 1.9: when there are exact matches on the taggings
Boost 0.2: when there are matches on the name but with one character written wrong
Boost 0.1: when there are matches in the extra (description) field
The problem I am facing is that the matches with one character written wrong and no tagging score higher than the matches with the right tagging and the whole word written wrong. That should be the other way...
Any help would be appreciated :)
There is no clear answer to this. Your best friend is Explain API,It will tell you how each and every document's score is calculated.
The most important thing to remember is boost is simply one of the factors considered while calculating score. From the Docs
Practically, there is no simple formula for deciding on the “correct” boost value for a particular query clause. It’s a matter of try-it-and-see. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors
It would help you a lot if you go through Theory and Lucene's Practical Scoring Function. This is the formula used by Lucene.
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
Now One of the several reasons you are not getting expected results could be norm(t,d) and idf(t)². For e.g if you have extra field as paus me and other fields have something like my name is some paus something, that would give field length norm i.e norm(t.d) higher value. Also if you have say 10000 documents and only one document has paus in extra field, that would make Inverse Document Frequency pretty high because it is calculated as idf(t) = 1 + log ( numDocs / (docFreq + 1)) here numDocs=10000 and docFreq=1 and this value will be squared. I had exactly this problem in my dataset.
Fuzzy query scoring higher could be related to this issue which is basically a Lucene Issue. This is fixed in latest version.
One way that might work is giving constant_score to last two clauses and say a boost of 5 to first two clauses. This would help in understanding.
Try to solve this issue step by step, start with two clauses and see output of explain api, then try with three and finally all four. Also remove field boosting and try with query boost only. Gradually you will figure out.
I hope this helps!!

Adjusting Elasticsearch _score based on field value, relative to other matching document's field value

We're updating our search system from Solr to Elasticsearch. We've already improved lots of things, but something we haven't got right yet is boosting a document's (product's) score by the popularity of the product (it's an ecommerce website).
This is what we have currently (with lots of irrelevant bits stripped out):
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query": "renal dog food",
"fields": [ "family_name^20", "parent_categories^2", "description^0.2", "product_suffixes^8", "facet_values^5" ],
"operator": "and",
"type": "best_fields",
"tie_breaker": 0.3
}
},
"functions": [{
"script_score": {
"script": "_score * log1p(1 + doc['popularity_score'].value)"
}
}],
"score_mode": "sum"
}
},
"sort": [
{ "_score": "desc" }
],
}
The popularity_score field contains the total number of orders containing this item in the last 6 weeks. Some items will have never been ordered and some will have had up 30,000 (with potentially a lot more as we continue to grow the business). It's quite a bit range.
The problem we have is that a document (product) might be a really good match text-wise but not very popular. We then have another not-very-relevant product does just about match the query, but because it is very popular it jumps up the list. What we are looking for is something will allow the popularity_score to be taken relative to the popularity_score of other matching results and get some form of normalisation, rather than just being taken as is (log1p doesn't seem to be enough sometimes). Does anyone have any suggestions or ideas?
Thank you!

Resources