Elasticsearch `function_score` with `score_mode` confusion when used with nested objects - elasticsearch

Background:
I have the following mapping for curriculum_posts documents. Notice the nested skills property.
{
"curriculum_posts" : {
"mappings" : {
"dynamic" : "false",
"properties" : {
"title" : {
"type" : "text",
"analyzer" : "english"
},
"skills" : {
"type" : "nested",
"properties" : {
"slug" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"text" : {
"type" : "text"
}
}
},
"start_skill_level" : {
"type" : "keyword"
},
"start_skill_level_value" : {
"type" : "integer"
}
}
}
}
}
}
}
A sample record looks like this:
{
"_source" : {
"skills" : [
{
"start_skill_level_value" : 1,
"slug" : "infrastructure-as-code-iac"
},
{
"start_skill_level_value" : 1,
"slug" : "devops"
}
],
"title" : "Terraform: Infrastructure as code"
}
}
I wanted to run a query that return all documents but with scores matching the number of skills.slug values that matched. My query looks like this:
{
"query": {
"nested": {
"path": "skills",
"query": {
"function_score": {
"query": { "match_all": {} },
"functions": [
{ "script_score": { "script": "0" } },
{
"filter": {
"term": { "skills.slug.raw": { "value": "devops" } }
},
"weight": 2
},
{
"filter": {
"term": { "skills.slug.raw": { "value": "infrastructure-as-code-iac" } }
},
"weight": 2
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}
}
}
I decided to use function_score with boost_mode: replace so that the scores from documents are ignored and only the function scores are taken. The score_mode: sum to ensure that the scores from the function matches are summed up.
The problem
So, for the above query, on the example document above, I was expecting the score to be 4.0 because it matches the skills.slug for both infrastructure-as-code-iac and devops. However, I the score in the result is only 2.0 for the document.
Question
I suppose I'm not understanding how function_score takes the scores from the functions or how my functions are effecting the score. Could someone help me understand the scoring here?
Some debugging
I looked at the explanation but I'm unable to decode much information from it. Nevertheless, here is the explanation:
{
"_index" : "curriculum_posts",
"_type" : "_doc",
"_id" : "18",
"matched" : true,
"explanation" : {
"value" : 2.0,
"description" : "Score based on 2 child docs in range from 83 to 93, best match:",
"details" : [
{
"value" : 2.0,
"description" : "sum of:",
"details" : [
{
"value" : 2.0,
"description" : "min of:",
"details" : [
{
"value" : 2.0,
"description" : "function score, score mode [sum]",
"details" : [
{
"value" : 0.0,
"description" : "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='0', options={}, params={}}\"",
"details" : [
{
"value" : 1.0,
"description" : "_score: ",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
]
},
{
"value" : 2.0,
"description" : "function score, product of:",
"details" : [
{
"value" : 1.0,
"description" : "match filter: skills.slug.raw:infrastructure-as-code-iac",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "product of:",
"details" : [
{
"value" : 1.0,
"description" : "constant score 1.0 - no function provided",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "weight",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "_type:__skills",
"details" : [ ]
}
]
}
]
}
]
}
}

Related

What's the difference between Id's query and Term Query when finding documents by "_id"?

I want to get document by "_id", I have 3 choices:
GET document by "_id" GET order/_doc/001
Use Id's Query, GET order/_search { "query": { "ids" : { "values" : ["001"] } } } Though Id's query takes array of Id's but I will be using it to get only one document at a time, so just passing one id in "values" : ["001"]
Use Term Query GET order/_search { "query": {"term": {"_id" : "001"}}}
I want to know what's the difference between Id's query and Term Query, performance wise and any other points that I should be aware of?
Which one I should choose (between Id's and Term Query)?
Any help is much appreciated:)
The first option is not a search and simply gets the document by id.
If you look at the execution plan of the second and third queries, you'll notice that they are identical:
Ids query:
GET order/_search
{
"explain": true,
"query": {
"ids": {
"values": ["001"]
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Term query:
GET order/_search
{
"explain": true,
"query": {
"term": {
"_id": "001"
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Any difference? None!

How multi_match search in elastic on main object and nested array of objects?

I'm using elastic-search v7 and I have mapped object like below.
Items its nested array of objects.
My problem is, when I try search by multi_match items fields, its not working like I expect, result is empty. But when I try to search with query and boolean, its finds my document.
I don't correct understand what a different there, how I understand is query_search its exact matches using for filter and aggregation of data, and multi_match for full text search and autocomplete , right?
And how to find documents searching in root fields and nested fields?
{
"orders" : {
"aliases" : { },
"mappings" : {
"properties" : {
"amazonOrderId" : {
"type" : "keyword"
},
"carrierCode" : {
"type" : "text"
},
"carrierName" : {
"type" : "text"
},
"id" : {
"type" : "keyword"
},
"items" : {
"type" : "nested",
"properties" : {
"amazonItemId" : {
"type" : "keyword"
},
"amazonPrice" : {
"type" : "integer"
},
"amazonQuantity" : {
"type" : "integer"
},
"amazonSku" : {
"type" : "keyword"
},
"graingerItem" : {
"type" : "nested"
},
"graingerOrderId" : {
"type" : "keyword"
},
"graingerPrice" : {
"type" : "integer"
},
"graingerShipDate" : {
"type" : "date"
},
"graingerShipMethod" : {
"type" : "short"
},
"graingerTrackingNumber" : {
"type" : "keyword"
},
"graingerWebNumber" : {
"type" : "keyword"
},
"id" : {
"type" : "keyword"
}
}
}
}
}
}
}
multi_match request
GET orders/_search
{
"query":{
"multi_match" : {
"query": "4.48 - 1 pack - 4.48",
"fields": [
"items.amazonSku",
"carrierCode",
"recipientName"
]
}
}
}
Debugging by _explain api returns me that description
"explanation" : {
"value" : 0.0,
"description" : "Failure to meet condition(s) of required/prohibited clause(s)",
"details" : [
{
"value" : 0.0,
"description" : "no match on required clause (items.amazonSku:4.48 - 1 pack - 4.48)",
"details" : [
{
"value" : 0.0,
"description" : "no matching term",
"details" : [ ]
}
]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Query search
GET orders/_search
{
"query": {
"nested": {
"path": "items",
"query": {
"bool": {
"must": [
{ "match": { "items.amazonSku": "4.48 - 1 pack - 4.48"}}
]
}
}
}
}
}
Since you are querying on nested field items, you need to include the nested param in your query so that it searches for the nested field object
Modify your search as
{
"query": {
"nested": {
"path": "items",
"query": {
"multi_match": {
"query": "4.48 - 1 pack - 4.48",
"fields": [
"items.amazonSku"
]
}
}
}
}
}

Elastic search different query norm across shards

I'm rather new to ES and I have been studying scoring in ES in an attempt to improve the quality of search results. I have come across a situation in which the queryNorm function is very different (5X as large) across shards. I can see the dependency on the idf for the terms in the query, which can be different across shards. However, in my case, I have a single search term + the idf measure across shards are close to each other (definitely not enough to cause the X 5 times difference). I will briefly describe my setup, including my query and the result from the explain endpoint.
Setup
I have an index with ~ 6500 docs which are distributed across 5 shards. I mention there are no index time boosts on the fields that appear in the query below. I mention my setup uses ES 2.4 with "query_then_fetch". My query:
{
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must" : [ ],
"must_not" : [ ],
"should" : [ {
"multi_match" : {
"query" : "pds",
"fields" : [ "field1" ],
"lenient" : true,
"fuzziness" : "0"
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field2" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 1000.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field3" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 500.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field4" ],
"lenient" : true,
"fuzziness" : "0",
"boost": 100.0
}
} ],
"must_not" : [ ],
"should" : [ ],
"filter" : [ ]
}
},
"size" : 1000,
"min_score" : 0.0
}
Explain output for 2 of the documents (one having query norm 5X times as large as the other one):
{
"_shard" : 4,
"_explanation" : {
"value" : 2.046937,
"description" : "product of:",
"details" : [ {
"value" : 4.093874,
"description" : "sum of:",
"details" : [ {
"value" : 0.112607226,
"description" : "weight(field1:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.112607226,
"description" : "score(doc=93,freq=1.0), product of:",
"details" : [ {
"value" : 0.019996,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 3.9812667,
"description" : "weight(field4:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 3.9812667,
"description" : "score(doc=93,freq=2.0), product of:",
"details" : [ {
"value" : 0.9998001,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 3.9820628,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
},
{
"_shard" : 2,
"_explanation" : {
"value" : 0.4143453,
"description" : "product of:",
"details" : [ {
"value" : 0.8286906,
"description" : "sum of:",
"details" : [ {
"value" : 0.018336227,
"description" : "weight(field1:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.018336227,
"description" : "score(doc=58,freq=1.0), product of:",
"details" : [ {
"value" : 0.0030464241,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 0.81035435,
"description" : "weight(field4:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.81035435,
"description" : "score(doc=58,freq=2.0), product of:",
"details" : [ {
"value" : 0.1523212,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.3200364,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 0.625,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
}
Notice how the queryNorm on field1 from the document in shard 4 is "0.0017753748" (with idf 5.6314874), while the queryNorm for the same field for doc in shard 2 is "0.0002.5307006" (with idf 6.0189342). I've tried to follow by hand the calculation for queryNorm using the formula on http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html , but failed to achieve the same answers.
I haven't seen too many threads / posts regarding calculating queryNorm ; one which I've found useful is http://www.openjems.com/tag/querynorm/ (this is actually Solr, but since the query is "query_then_fetch" ; the Lucene calculations should be the only thing that matter, so I expect they should behave similarly). However, I couldn't derive the right queryNorm values using the same approach (as fast as I understand, t.getBoost() should be 1 in my case since there are no index time field boosts + no special field boost in the query above).
Does anyone have any suggestion as to what might be going on here?
You can set search_type to be equal dfs_query_then_fetch:
{
"search_type": "dfs_query_then_fetch",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [],
"must_not": [],
"should": [
{
"multi_match": {
"query": "pds",
"fields": [
"field1"
],
"lenient": true,
"fuzziness": "0"
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field2"
],
"lenient": true,
"fuzziness": "0",
"boost": 1000.0
}
}
]
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field3"
],
"lenient": true,
"fuzziness": "0",
"boost": 500.0
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field4"
],
"lenient": true,
"fuzziness": "0",
"boost": 100.0
}
}
],
"must_not": [],
"should": [],
"filter": []
}
},
"size": 1000,
"min_score": 0.0
}
In this case all norm values will be global. But it may impact the query performance. If your index is small, you can also create an index with a single shard. But if you have much more documents, these values should be that different.

Elasticsearch Array (Label/Tag Querying

I really think that I'm trying to do is fairly simple. I'm simply trying to query for N tags. A clear example of this was asked and answered over at "Elasticsearch: How to use two different multiple matching fields?". Yet, that solution doesn't seem to work for the latest version of ES (more likely, I'm simply doing it wrong).
To show the current data and to demonstrate a working query, see below:
{
"query": {
"filtered": {
"filter": {
"terms": {
"Price": [10,5]
}
}
}
}
}
Here are the results for this. As you can see, 5 and 10 are showing up (this demonstrates that basic queries do work):
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGMYXB5vRcKBZaDw",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "a",
"Name" : "Sample 1",
"Timestamp" : 1.455031083799152E9,
"Price" : "10",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGHHXB5vRcKBZaDF",
"_score" : 1.0,
"_source" : {
"Category" : [ "Small Signs" ],
"Code" : "b",
"Name" : "Sample 2",
"Timestamp" : 1.45503108346191E9,
"Price" : "5",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGILXB5vRcKBZaDO",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "c",
"Name" : "Sample 3",
"Timestamp" : 1.455031083530215E9,
"Price" : "10",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGGgXB5vRcKBZaDA",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "d",
"Name" : "Sample 4",
"Timestamp" : 1.4550310834233E9,
"Price" : "10",
"IsEnabled" : true
}
}]
}
}
As a side note: the following bool query gives the exact same results:
{
"query": {
"bool": {
"must": [{
"terms": {
"Price": [10,5]
}
}]
}
}
}
Notice Category...
Let's simply copy/paste Category into a query:
{
"query": {
"filtered": {
"filter": {
"terms": {
"Category" : [ "Medium Signs" ]
}
}
}
}
}
This gives the following gem:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Again, here's the bool query version that gives the same 0-hit result:
{
"query": {
"bool": {
"must": [{
"terms": {
"Category" : [ "Medium Signs" ]
}
}]
}
}
}
In the end, I definitely need something similar to "Category" : [ "Medium Signs", "Small Signs" ] working (in concert with other label queries and minimum_should_match as well-- but I can't even get this bare-bones query to work).
I have zero clue why this is. I poured over the docs for houring, trying everything I can see. Do I need to look into debugging various encodings? Is my syntax archaic?
The problem here is that ElasticSearch is analyzing and betokening the Category field, and the terms filter expects an exact match. One solution here is to add a raw field to Category inside your entry mapping:
PUT labelsample
{
"mappings": {
"entry": {
"properties": {
"Category": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"Code": {
"type": "string"
},
"Name": {
"type": "string"
},
"Timestamp": {
"type": "date",
"format": "epoch_millis"
},
"Price": {
"type": "string"
},
"IsEnabled": {
"type": "boolean"
}
}
}
}
}
...and filter on the raw field:
GET labelsample/entry/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"Category.raw" : [ "Medium Signs" ]
}
}
}
}
}

Elasticsearch gives different scores for same documents

I have some documents which have the same content but when I try to query for these documents, I am getting different scores although the queried field contains the same text. I have explained the scores but I am not able to analyse and find the reason for different scores.
My query is
curl 'localhost:9200/acqindex/_search?pretty=1' -d '{
"explain" : true,
"query" : {
"query_string" : {
"query" : "text:shimla"
}
}
}'
Search response :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 31208,
"max_score" : 268.85962,
"hits" : [ {
"_shard" : 0,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efec6c38cc6fdabd8653a3",
"_score" : 268.85962, "_source" : {"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.85962,
"description" : "sum of:",
"details" : [ {
"value" : 38.438133,
"description" : "weight(text:shi in 5860), product of:",
"details" : [ {
"value" : 0.37811017,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 101.658554,
"description" : "fieldWeight(text:shi in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 66.8446,
"description" : "weight(text:shim in 5860), product of:",
"details" : [ {
"value" : 0.49862078,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 134.05899,
"description" : "fieldWeight(text:shim in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.75818,
"description" : "weight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 0.5514458,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.2615,
"description" : "fieldWeight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.8187,
"description" : "weight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 0.55164987,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.31636,
"description" : "fieldWeight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efed1c38cc6fdabd8b8d2f",
"_score" : 268.29953, "_source" : {"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal pradesh,IN","category":["Hill","See and Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.29953,
"description" : "sum of:",
"details" : [ {
"value" : 38.52957,
"description" : "weight(text:shi in 14769), product of:",
"details" : [ {
"value" : 0.37895453,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 101.67334,
"description" : "fieldWeight(text:shi in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 66.67524,
"description" : "weight(text:shim in 14769), product of:",
"details" : [ {
"value" : 0.49850821,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 133.74953,
"description" : "fieldWeight(text:shim in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.53204,
"description" : "weight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 0.5512571,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.90204,
"description" : "fieldWeight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.56268,
"description" : "weight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 0.55136067,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.92982,
"description" : "fieldWeight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
} ]
}
}
}
}
The documents are :
{"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"}
{"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal
pradesh,IN","category":["Hill","See and
Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"}
Please guide me in understanding the reason for the difference in scores.
The lucene score depends on different factors. Using the tf idf similarity (default one) it mainly depends on:
Term frequency: how much the terms found are frequent within the document
Inverted document frequency: how much the terms found appear among the documents (while index)
Field norms (including index time boosting). Shorter fields get higher score than longer ones.
In your case you have to take into account that your two documents come from different shards, thus the score is computed separately on each of those, since every shard is in fact a separate lucene index.
You might want to have a look at the more expensive DFS, Query then Fetch search type that elasticsearch provides for more accurate scoring. The default one is the simple Query then Fetch.
javanna clearly pointed out the problem indicating that a difference of scores comes from the fact that scoring happens in multiple shards. Those shards may have different number of documents. This affects scoring algorithm.
However, authors of Elasticsearch: The Definitive Guide inform:
The differences between local and global IDF [inverse document frequency] diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
You should not use dfs_query_then_fetch on production. For testing, put your index on one primary shard or specify ?search_type=dfs_query_then_fetch.

Resources