Elasticsearch gives different scores for same documents - elasticsearch

I have some documents which have the same content but when I try to query for these documents, I am getting different scores although the queried field contains the same text. I have explained the scores but I am not able to analyse and find the reason for different scores.
My query is
curl 'localhost:9200/acqindex/_search?pretty=1' -d '{
"explain" : true,
"query" : {
"query_string" : {
"query" : "text:shimla"
}
}
}'
Search response :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 31208,
"max_score" : 268.85962,
"hits" : [ {
"_shard" : 0,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efec6c38cc6fdabd8653a3",
"_score" : 268.85962, "_source" : {"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.85962,
"description" : "sum of:",
"details" : [ {
"value" : 38.438133,
"description" : "weight(text:shi in 5860), product of:",
"details" : [ {
"value" : 0.37811017,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 101.658554,
"description" : "fieldWeight(text:shi in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 66.8446,
"description" : "weight(text:shim in 5860), product of:",
"details" : [ {
"value" : 0.49862078,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 134.05899,
"description" : "fieldWeight(text:shim in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.75818,
"description" : "weight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 0.5514458,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.2615,
"description" : "fieldWeight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.8187,
"description" : "weight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 0.55164987,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.31636,
"description" : "fieldWeight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efed1c38cc6fdabd8b8d2f",
"_score" : 268.29953, "_source" : {"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal pradesh,IN","category":["Hill","See and Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.29953,
"description" : "sum of:",
"details" : [ {
"value" : 38.52957,
"description" : "weight(text:shi in 14769), product of:",
"details" : [ {
"value" : 0.37895453,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 101.67334,
"description" : "fieldWeight(text:shi in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 66.67524,
"description" : "weight(text:shim in 14769), product of:",
"details" : [ {
"value" : 0.49850821,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 133.74953,
"description" : "fieldWeight(text:shim in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.53204,
"description" : "weight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 0.5512571,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.90204,
"description" : "fieldWeight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.56268,
"description" : "weight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 0.55136067,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.92982,
"description" : "fieldWeight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
} ]
}
}
}
}
The documents are :
{"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"}
{"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal
pradesh,IN","category":["Hill","See and
Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"}
Please guide me in understanding the reason for the difference in scores.

The lucene score depends on different factors. Using the tf idf similarity (default one) it mainly depends on:
Term frequency: how much the terms found are frequent within the document
Inverted document frequency: how much the terms found appear among the documents (while index)
Field norms (including index time boosting). Shorter fields get higher score than longer ones.
In your case you have to take into account that your two documents come from different shards, thus the score is computed separately on each of those, since every shard is in fact a separate lucene index.
You might want to have a look at the more expensive DFS, Query then Fetch search type that elasticsearch provides for more accurate scoring. The default one is the simple Query then Fetch.

javanna clearly pointed out the problem indicating that a difference of scores comes from the fact that scoring happens in multiple shards. Those shards may have different number of documents. This affects scoring algorithm.
However, authors of Elasticsearch: The Definitive Guide inform:
The differences between local and global IDF [inverse document frequency] diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
You should not use dfs_query_then_fetch on production. For testing, put your index on one primary shard or specify ?search_type=dfs_query_then_fetch.

Related

How to calculate gauss value?

I am just curious to know how this value came i applied formula but i think i am missing something can anybody tell me please.I am running single ELK stack version 7.16
POST sneaker/_search
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"price": {
"origin": "300",
"scale": "200"
}
}
}
]
}
}
, "explain": true
}
Query result
"max_score" : 1.0,
"hits" : [
{
"_shard" : "[sneaker][0]",
"_node" : "29ds_f0VSM6_-eDNhdQPLw",
"_index" : "sneaker",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.0,
"_source" : {
"brand" : "flite",
"price" : 300,
"rating" : 2,
"release_date" : "2020-12-21"
},
"_explanation" : {
"value" : 1.0,
"description" : "function score, product of:",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "min of:",
"details" : [
{
"value" : 1.0,
"description" : "Function for field price:",
"details" : [
{
"value" : 1.0,
"description" : "exp(-0.5*pow(MIN[Math.max(Math.abs(300.0(=doc value) - 300.0(=origin))) - 0.0(=offset), 0)],2.0)/28853.900817779268)",
"details" : [ ]
}
]
},
I look for guassian distribution but it is different from this.
I want to know how 28853.900817779268 value came
If you look on the official documentation for the gauss decay function, you'll find the following formula for computing the sigma:
Using scale = 200 and decay = 0.5 (default value if unspecified), we get the following:
-200^2 / (2 * ln (0.5)) = -28853.90081
which is what you're seeing in the explanation of the query.

Elasticsearch `function_score` with `score_mode` confusion when used with nested objects

Background:
I have the following mapping for curriculum_posts documents. Notice the nested skills property.
{
"curriculum_posts" : {
"mappings" : {
"dynamic" : "false",
"properties" : {
"title" : {
"type" : "text",
"analyzer" : "english"
},
"skills" : {
"type" : "nested",
"properties" : {
"slug" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"text" : {
"type" : "text"
}
}
},
"start_skill_level" : {
"type" : "keyword"
},
"start_skill_level_value" : {
"type" : "integer"
}
}
}
}
}
}
}
A sample record looks like this:
{
"_source" : {
"skills" : [
{
"start_skill_level_value" : 1,
"slug" : "infrastructure-as-code-iac"
},
{
"start_skill_level_value" : 1,
"slug" : "devops"
}
],
"title" : "Terraform: Infrastructure as code"
}
}
I wanted to run a query that return all documents but with scores matching the number of skills.slug values that matched. My query looks like this:
{
"query": {
"nested": {
"path": "skills",
"query": {
"function_score": {
"query": { "match_all": {} },
"functions": [
{ "script_score": { "script": "0" } },
{
"filter": {
"term": { "skills.slug.raw": { "value": "devops" } }
},
"weight": 2
},
{
"filter": {
"term": { "skills.slug.raw": { "value": "infrastructure-as-code-iac" } }
},
"weight": 2
}
],
"score_mode": "sum",
"boost_mode": "replace"
}
}
}
}
}
I decided to use function_score with boost_mode: replace so that the scores from documents are ignored and only the function scores are taken. The score_mode: sum to ensure that the scores from the function matches are summed up.
The problem
So, for the above query, on the example document above, I was expecting the score to be 4.0 because it matches the skills.slug for both infrastructure-as-code-iac and devops. However, I the score in the result is only 2.0 for the document.
Question
I suppose I'm not understanding how function_score takes the scores from the functions or how my functions are effecting the score. Could someone help me understand the scoring here?
Some debugging
I looked at the explanation but I'm unable to decode much information from it. Nevertheless, here is the explanation:
{
"_index" : "curriculum_posts",
"_type" : "_doc",
"_id" : "18",
"matched" : true,
"explanation" : {
"value" : 2.0,
"description" : "Score based on 2 child docs in range from 83 to 93, best match:",
"details" : [
{
"value" : 2.0,
"description" : "sum of:",
"details" : [
{
"value" : 2.0,
"description" : "min of:",
"details" : [
{
"value" : 2.0,
"description" : "function score, score mode [sum]",
"details" : [
{
"value" : 0.0,
"description" : "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='0', options={}, params={}}\"",
"details" : [
{
"value" : 1.0,
"description" : "_score: ",
"details" : [
{
"value" : 1.0,
"description" : "*:*",
"details" : [ ]
}
]
}
]
},
{
"value" : 2.0,
"description" : "function score, product of:",
"details" : [
{
"value" : 1.0,
"description" : "match filter: skills.slug.raw:infrastructure-as-code-iac",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "product of:",
"details" : [
{
"value" : 1.0,
"description" : "constant score 1.0 - no function provided",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "weight",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "_type:__skills",
"details" : [ ]
}
]
}
]
}
]
}
}

Elastic search different query norm across shards

I'm rather new to ES and I have been studying scoring in ES in an attempt to improve the quality of search results. I have come across a situation in which the queryNorm function is very different (5X as large) across shards. I can see the dependency on the idf for the terms in the query, which can be different across shards. However, in my case, I have a single search term + the idf measure across shards are close to each other (definitely not enough to cause the X 5 times difference). I will briefly describe my setup, including my query and the result from the explain endpoint.
Setup
I have an index with ~ 6500 docs which are distributed across 5 shards. I mention there are no index time boosts on the fields that appear in the query below. I mention my setup uses ES 2.4 with "query_then_fetch". My query:
{
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must" : [ ],
"must_not" : [ ],
"should" : [ {
"multi_match" : {
"query" : "pds",
"fields" : [ "field1" ],
"lenient" : true,
"fuzziness" : "0"
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field2" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 1000.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field3" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 500.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field4" ],
"lenient" : true,
"fuzziness" : "0",
"boost": 100.0
}
} ],
"must_not" : [ ],
"should" : [ ],
"filter" : [ ]
}
},
"size" : 1000,
"min_score" : 0.0
}
Explain output for 2 of the documents (one having query norm 5X times as large as the other one):
{
"_shard" : 4,
"_explanation" : {
"value" : 2.046937,
"description" : "product of:",
"details" : [ {
"value" : 4.093874,
"description" : "sum of:",
"details" : [ {
"value" : 0.112607226,
"description" : "weight(field1:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.112607226,
"description" : "score(doc=93,freq=1.0), product of:",
"details" : [ {
"value" : 0.019996,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 3.9812667,
"description" : "weight(field4:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 3.9812667,
"description" : "score(doc=93,freq=2.0), product of:",
"details" : [ {
"value" : 0.9998001,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 3.9820628,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
},
{
"_shard" : 2,
"_explanation" : {
"value" : 0.4143453,
"description" : "product of:",
"details" : [ {
"value" : 0.8286906,
"description" : "sum of:",
"details" : [ {
"value" : 0.018336227,
"description" : "weight(field1:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.018336227,
"description" : "score(doc=58,freq=1.0), product of:",
"details" : [ {
"value" : 0.0030464241,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 0.81035435,
"description" : "weight(field4:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.81035435,
"description" : "score(doc=58,freq=2.0), product of:",
"details" : [ {
"value" : 0.1523212,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.3200364,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 0.625,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
}
Notice how the queryNorm on field1 from the document in shard 4 is "0.0017753748" (with idf 5.6314874), while the queryNorm for the same field for doc in shard 2 is "0.0002.5307006" (with idf 6.0189342). I've tried to follow by hand the calculation for queryNorm using the formula on http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html , but failed to achieve the same answers.
I haven't seen too many threads / posts regarding calculating queryNorm ; one which I've found useful is http://www.openjems.com/tag/querynorm/ (this is actually Solr, but since the query is "query_then_fetch" ; the Lucene calculations should be the only thing that matter, so I expect they should behave similarly). However, I couldn't derive the right queryNorm values using the same approach (as fast as I understand, t.getBoost() should be 1 in my case since there are no index time field boosts + no special field boost in the query above).
Does anyone have any suggestion as to what might be going on here?
You can set search_type to be equal dfs_query_then_fetch:
{
"search_type": "dfs_query_then_fetch",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [],
"must_not": [],
"should": [
{
"multi_match": {
"query": "pds",
"fields": [
"field1"
],
"lenient": true,
"fuzziness": "0"
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field2"
],
"lenient": true,
"fuzziness": "0",
"boost": 1000.0
}
}
]
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field3"
],
"lenient": true,
"fuzziness": "0",
"boost": 500.0
}
},
{
"multi_match": {
"query": "pds",
"fields": [
"field4"
],
"lenient": true,
"fuzziness": "0",
"boost": 100.0
}
}
],
"must_not": [],
"should": [],
"filter": []
}
},
"size": 1000,
"min_score": 0.0
}
In this case all norm values will be global. But it may impact the query performance. If your index is small, you can also create an index with a single shard. But if you have much more documents, these values should be that different.

Why is queryWeight included for some result scores, but not others, in the same query?

I'm executing a query_string query with one term on multiple fields, _all and tags.name, and trying to understand the scoring. Query: {"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}. Here are the documents returned by the query:
Document 1 has an exact match on tags.name, but not on _all.
Document 8 has an exact match on both tags.name and on _all.
Document 8 should win, and it does, but I'm confused by how the scoring works out. It seems like Document 1 is getting penalized by having its tags.name score multiplied by the IDF twice, whereas Document 8's tags.name score is only multiplied by the IDF once. In short:
They both have a component weight(tags.name:animal in 0) [PerFieldSimilarity].
In Document 1, we have weight = score = queryWeight x fieldWeight.
In Document 8, we have weight = fieldWeight!
Since queryWeight contains idf, this results in Document 1 getting penalized by its idf twice.
Can anyone make sense of this?
Additional information
If I remove _all from the fields of the query, queryWeight is completely gone from the explain.
Adding "use_dis_max":true as an option has no effect.
However, additionally adding "tie_breaker":0.7 (or any value) does affect Document 8 by giving it the more-complicated formula we see in Document 1.
Thoughts: It's plausible that a boolean query (which this is) might do this on purpose to give more weight to queries that match more than one sub-query. However, this doesn't make any sense for a dis_max query, which is supposed to just return the maximum of the sub-queries.
Here are the relevant explain requests. Look for embedded comments.
Document 1 (match only on tags.name):
curl -XGET 'http://localhost:9200/questions/question/1/_explain?pretty' -d '{"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}':
{
"ok" : true,
"_index" : "questions_1390104463",
"_type" : "question",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.058849156,
"description" : "max of:",
"details" : [ {
"value" : 0.058849156,
"description" : "weight(tags.name:animal in 0) [PerFieldSimilarity], result of:",
// weight = score = queryWeight x fieldWeight
"details" : [ {
// score and queryWeight are NOT a part of the other explain!
"value" : 0.058849156,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [ {
"value" : 0.30685282,
"description" : "queryWeight, product of:",
"details" : [ {
// This idf is NOT a part of the other explain!
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}, {
"value" : 0.19178301,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.625,
"description" : "fieldNorm(doc=0)"
} ]
} ]
} ]
} ]
}
Document 8 (match on both _all and tags.name):
curl -XGET 'http://localhost:9200/questions/question/8/_explain?pretty' -d '{"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}':
{
"ok" : true,
"_index" : "questions_1390104463",
"_type" : "question",
"_id" : "8",
"matched" : true,
"explanation" : {
"value" : 0.15342641,
"description" : "max of:",
"details" : [ {
"value" : 0.033902764,
"description" : "btq, product of:",
"details" : [ {
"value" : 0.033902764,
"description" : "weight(_all:anim in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.033902764,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 0.70710677,
"description" : "tf(freq=0.5), with freq of:",
"details" : [ {
"value" : 0.5,
"description" : "phraseFreq=0.5"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.15625,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}, {
"value" : 1.0,
"description" : "allPayload(...)"
} ]
}, {
"value" : 0.15342641,
"description" : "weight(tags.name:animal in 0) [PerFieldSimilarity], result of:",
// weight = fieldWeight
// No score or queryWeight in sight!
"details" : [ {
"value" : 0.15342641,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
} ]
}
}
I've no answer. Just want to mention I posted question to the Elasticsearch forum: https://groups.google.com/forum/#!topic/elasticsearch/xBKlFkq0SP0
I'll notify here when I'll get the answer.

Elasticsearch not returning an exact match first

I have an elastic search index with a field for exact matches, and somehow i get both a lot of similar results (which I don't mind) and those similar results en up sorted before the exact match, (which i do mind.)
Can someone explain what's going on and how to fix it?
My mapping is like this
"exact":{
"type":"string",
"boost":10.0,
"analyzer":"keyword"
},
My query that searches for "AAPL P JAN 2014 885,00" is like this:
{
"size" : 21,
"query" : {
"field" : {
"exact" : "AAPL P JAN 2014 885,00"
}
},
"explain" : true,
"sort" : [ {
"_score" : {
"order" : "desc"
}
} ],
"facets" : {
"category" : {
"terms" : {
"field" : "category",
"size" : 10
}
}
}
}
And the returned documents end up in this order:
{"exact":["APPLE INC","US0378331005","AAPL","73773"],"id-compound":"AAPL"}
{"exact":["AAPL","73773","AAPL P JAN 2014 675,00"],"id-compound":"AAPL*PUT*20140118*675"}
{"exact":["AAPL","73773","AAPL C JAN 2014 500,00"],"id-compound":"AAPL*CALL*20140118*500"}
etc, with the exact match a bunch of results down the line.
Can someone explain to me why the exact match doesn't end on top?
The search results with full explain is below if it helps make sense of things.
"hits" : [ {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL",
"_score" : 1306.8339, "_source" : {"exact":["APPLE INC","US0378331005","AAPL","73773"],"id-compound":"AAPL"},
"_explanation" : {
"value" : 1306.8339,
"description" : "product of:",
"details" : [ {
"value" : 6534.169,
"description" : "sum of:",
"details" : [ {
"value" : 6534.169,
"description" : "weight(exact:AAPL in 9096), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 25272.875,
"description" : "fieldWeight(exact:AAPL in 9096), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 4096.0,
"description" : "fieldNorm(field=exact, doc=9096)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL*PUT*20140118*675",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 675,00"],"id-compound":"AAPL*PUT*20140118*675"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 18), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 18), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=18)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL*CALL*20140118*500",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL C JAN 2014 500,00"],"id-compound":"AAPL*CALL*20140118*500"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 383), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 383), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=383)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_id" : "AAPL*PUT*20140118*940",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 940,00"],"id-compound":"AAPL*PUT*20140118*940"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 794), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 794), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=794)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}
and just in case where's what happens if i analyze the data i'm trying to store:
curl -XGET 'localhost:9200/instruments/_analyze?field=exact&pretty=true' -d 'ING P JUN 2013 6.00'
{
"tokens" : [ {
"token" : "ING P JUN 2013 6.00",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 1
} ]
I'm not sure if it's technically the best thing but if you're just after a single specific answer from elastic search you could just use a filter with a script that looked for an exact match.
{
from : 0,
size : 1,
"query" : {
"text_phrase" : {
"title" : "AAPL P JAN 2014 885,00"
}
},
"filter" : {
"script" : {
"script" : "_source.exact.contains(x)",
"params" : {
"x" : "AAPL P JAN 2014 885,00"
}
}
}
}
I've used this to obtain a single known entry from elastic search and it worked well for me.
I think you have found you answer, just wanted to give a bit more info for other with the same problem.
You use a field query which from the elasticsearch documentation:
Field Query:
A query that executes a query string against a specific field. It is a simplified version of query_string query (by setting the default_field to the field this query executed against).
I believe a query_string query is for text, i.e.: it does a lot to the query, making it fuzzy, etc...
What you want to use (and I think you found this out) is a term query which will not do anything to the search phrase, and so only give you exact matches.
NOTE: Analysis happens at 2 distinct times, index time and query time. Setting "analyzer": "keyword" seems to only affect search time queries "when searching using a query string" form elasticsearch docs. I must admit I don't know exactly what that means (I would guess query_string but it could also mean for searches like http://../_search?q=exact:{query here})
All three documents get exactly the same score, as you can see from the explain output they all match on "AAPL". The term always appears once in the documents (tf=1) and it appears on 211 out of 37299 documents (idf=6.1701355). The field norm is much higher since you are using index time boosting (the boost part in your mapping, 10), anyway no big deal since the match is always on the same field. It's just that if you have a match on other fields exact would pretty much always win, which might make sense in your case.
But the problem is that AAPL P JAN 2014 885,00 is not an exact match if I look at your documents. What I do see is that out of the 5 terms in your query only one matches, which is confirmed by the coord too in your explain output: coord(1/5)`.
The keyword analyzer seems to be applied, but as you see from the returned documents you are not sending the content of the exact field as a single value, but as an array of values. Each of its item won't be tokenized, since you are using the keyword analyzer, but still you have multiple tokens. I guess you have to check how you're indexing documents.
The reason why your keyword analyzer seems to be ignored in the search query is because ES tokenizes this string twice - first it runs its DSL tokenizer and then it runs the tokenizer specified in the maping on the rezult. This is explained in more detail in this article http://paulsabou.com/blog/2012/03/25/advanced-exact-matching-with-elastic-search/
You should NOT ANALYZE your id field.
Define your field as:
"exact":{
"type":"string",
"index":"not_analyzed"
}
Have a look at Finding Exact Values

Resources