Related
I recently started working with elasticsearch (version 7.17.2) and there is something related to term frequency normalization and boosting that I don't quite understand.
To keep it simple, suppose I just create an index with
PUT test
and add a couple of documents
POST test/_doc/1
{
"firstname": "foo",
"lastname": "bar"
}
POST test/_doc/2
{
"firstname": "foo",
"lastname": "baz"
}
Now I want to perform the following search
POST test/_search
{
"explain": true,
"query": {
"bool": {
"should": {
"multi_match": {
"fields": [
"firstname^3",
"lastname^5"
],
"query": "foo bar"
}
}
}
}
}
which returns
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 3.465736,
"hits" : [
{
"_shard" : "[test][0]",
"_node" : "Or9Q1aPLTi-liJvA8NJW6g",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.465736,
"_source" : {
"firstname" : "foo",
"lastname" : "bar"
},
"_explanation" : {
"value" : 3.465736,
"description" : "max of:",
"details" : [
{
"value" : 0.5469647,
"description" : "sum of:",
"details" : [
{
"value" : 0.5469647,
"description" : "weight(firstname:foo in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5469647,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 6.6000004,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 3.465736,
"description" : "sum of:",
"details" : [
{
"value" : 3.465736,
"description" : "weight(lastname:bar in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 3.465736,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 11.0,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.6931472,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[test][0]",
"_node" : "Or9Q1aPLTi-liJvA8NJW6g",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5469647,
"_source" : {
"firstname" : "foo",
"lastname" : "baz"
},
"_explanation" : {
"value" : 0.5469647,
"description" : "max of:",
"details" : [
{
"value" : 0.5469647,
"description" : "sum of:",
"details" : [
{
"value" : 0.5469647,
"description" : "weight(firstname:foo in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5469647,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 6.6000004,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
]
}
}
I purposedly gave more relevance to lastname with respect to firstname (5 vs. 3). In the explanation, for instance for the contribution of firstname:foo, the score is computed as boost * idf * tf.
While I gave the field firstname a relevance boost of 3, its actual boost according to the explanation is 6.6. After some investigation, I figured out that this value corresponds to 3 * (1.2 + 1), that is my boost of 3 mutiplied by (k_1 + 1), where k_1 corresponds to the parameter of the default BM25 similarity function, whose default value is 1.2.
I know this might be related to some normalization that elasticsearch performs behind the scenes (whose documentation is rather poor), but I have seen this happening in two ways:
Exactly as in this example, with tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)).
Like they do it on wikipedia, with tfNorm = (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)). Notice already that the value is called tfNorm instead of just tf, and that the (k1 + 1) factor appears explicitly in the tfNorm and not "hidden" in the boost. Here are the wikipedia elasticsearch settings and mappings, in case they help.
What I would like to clarify is what is the difference between these two behaviors and how to switch between them, perhaps by updating the mapping.
BONUS QUESTION: Actually, there is a third option, that we can see in the same wikipedia example, searching for the field all_near_match. There, tfNorm = (freq * (k1 + 1)) / (freq + k1), and there is an annotation saying that the b parameter in the BM25 similarity function is 0 because norms omitted for field. How does this other approach relate with the other two I described above?
Thank you very much!
I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:
Test data:
DELETE /my-index
PUT /my-index
POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
Search Query:
If I run a search query similar to one below:
GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}
Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field
Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.
A closer look at _explain
Here is the _explain results of running the same search query for Document #1:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4816045,
"description" : "max of:",
"details" : [
{
"value" : 1.4816045,
"description" : "sum of:",
"details" : [
{
"value" : 1.4816045,
"description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.4816045,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.4816046,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
And here is the _explain results of running the same search query for Document #10:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "10",
"matched" : true,
"explanation" : {
"value" : 0.36464313,
"description" : "max of:",
"details" : [
{
"value" : 0.36464313,
"description" : "sum of:",
"details" : [
{
"value" : 0.18232156,
"description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.18232156,
"description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.
Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?
Any help would be greatly appreciated. Thanks.
Your situation seems to be similar to the one described in the cross_fields query type so you should probably try it:
{
"multi_match": {
"query": "RareWord AnotherRareWord",
"fields": ["required_field","optional_field"],
"type": "cross_fields",
"operator": "and"
}
}
I want to get document by "_id", I have 3 choices:
GET document by "_id" GET order/_doc/001
Use Id's Query, GET order/_search { "query": { "ids" : { "values" : ["001"] } } } Though Id's query takes array of Id's but I will be using it to get only one document at a time, so just passing one id in "values" : ["001"]
Use Term Query GET order/_search { "query": {"term": {"_id" : "001"}}}
I want to know what's the difference between Id's query and Term Query, performance wise and any other points that I should be aware of?
Which one I should choose (between Id's and Term Query)?
Any help is much appreciated:)
The first option is not a search and simply gets the document by id.
If you look at the execution plan of the second and third queries, you'll notice that they are identical:
Ids query:
GET order/_search
{
"explain": true,
"query": {
"ids": {
"values": ["001"]
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Term query:
GET order/_search
{
"explain": true,
"query": {
"term": {
"_id": "001"
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Any difference? None!
I have an elastic search index with a field for exact matches, and somehow i get both a lot of similar results (which I don't mind) and those similar results en up sorted before the exact match, (which i do mind.)
Can someone explain what's going on and how to fix it?
My mapping is like this
"exact":{
"type":"string",
"boost":10.0,
"analyzer":"keyword"
},
My query that searches for "AAPL P JAN 2014 885,00" is like this:
{
"size" : 21,
"query" : {
"field" : {
"exact" : "AAPL P JAN 2014 885,00"
}
},
"explain" : true,
"sort" : [ {
"_score" : {
"order" : "desc"
}
} ],
"facets" : {
"category" : {
"terms" : {
"field" : "category",
"size" : 10
}
}
}
}
And the returned documents end up in this order:
{"exact":["APPLE INC","US0378331005","AAPL","73773"],"id-compound":"AAPL"}
{"exact":["AAPL","73773","AAPL P JAN 2014 675,00"],"id-compound":"AAPL*PUT*20140118*675"}
{"exact":["AAPL","73773","AAPL C JAN 2014 500,00"],"id-compound":"AAPL*CALL*20140118*500"}
etc, with the exact match a bunch of results down the line.
Can someone explain to me why the exact match doesn't end on top?
The search results with full explain is below if it helps make sense of things.
"hits" : [ {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL",
"_score" : 1306.8339, "_source" : {"exact":["APPLE INC","US0378331005","AAPL","73773"],"id-compound":"AAPL"},
"_explanation" : {
"value" : 1306.8339,
"description" : "product of:",
"details" : [ {
"value" : 6534.169,
"description" : "sum of:",
"details" : [ {
"value" : 6534.169,
"description" : "weight(exact:AAPL in 9096), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 25272.875,
"description" : "fieldWeight(exact:AAPL in 9096), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 4096.0,
"description" : "fieldNorm(field=exact, doc=9096)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL*PUT*20140118*675",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 675,00"],"id-compound":"AAPL*PUT*20140118*675"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 18), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 18), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=18)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_shard" : 0,
"_node" : "1",
"_index" : "instruments",
"_type" : "instrument",
"_id" : "AAPL*CALL*20140118*500",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL C JAN 2014 500,00"],"id-compound":"AAPL*CALL*20140118*500"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 383), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 383), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=383)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}, {
"_id" : "AAPL*PUT*20140118*940",
"_score" : 163.35423, "_source" : {"exact":["AAPL","73773","AAPL P JAN 2014 940,00"],"id-compound":"AAPL*PUT*20140118*940"},
"_explanation" : {
"value" : 163.35423,
"description" : "product of:",
"details" : [ {
"value" : 816.7711,
"description" : "sum of:",
"details" : [ {
"value" : 816.7711,
"description" : "weight(exact:AAPL in 794), product of:",
"details" : [ {
"value" : 0.25854474,
"description" : "queryWeight(exact:AAPL), product of:",
"details" : [ {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 0.0419026,
"description" : "queryNorm"
} ]
}, {
"value" : 3159.1094,
"description" : "fieldWeight(exact:AAPL in 794), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(exact:AAPL)=1)"
}, {
"value" : 6.1701355,
"description" : "idf(docFreq=211, maxDocs=37299)"
}, {
"value" : 512.0,
"description" : "fieldNorm(field=exact, doc=794)"
} ]
} ]
} ]
}, {
"value" : 0.2,
"description" : "coord(1/5)"
} ]
}
}
and just in case where's what happens if i analyze the data i'm trying to store:
curl -XGET 'localhost:9200/instruments/_analyze?field=exact&pretty=true' -d 'ING P JUN 2013 6.00'
{
"tokens" : [ {
"token" : "ING P JUN 2013 6.00",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 1
} ]
I'm not sure if it's technically the best thing but if you're just after a single specific answer from elastic search you could just use a filter with a script that looked for an exact match.
{
from : 0,
size : 1,
"query" : {
"text_phrase" : {
"title" : "AAPL P JAN 2014 885,00"
}
},
"filter" : {
"script" : {
"script" : "_source.exact.contains(x)",
"params" : {
"x" : "AAPL P JAN 2014 885,00"
}
}
}
}
I've used this to obtain a single known entry from elastic search and it worked well for me.
I think you have found you answer, just wanted to give a bit more info for other with the same problem.
You use a field query which from the elasticsearch documentation:
Field Query:
A query that executes a query string against a specific field. It is a simplified version of query_string query (by setting the default_field to the field this query executed against).
I believe a query_string query is for text, i.e.: it does a lot to the query, making it fuzzy, etc...
What you want to use (and I think you found this out) is a term query which will not do anything to the search phrase, and so only give you exact matches.
NOTE: Analysis happens at 2 distinct times, index time and query time. Setting "analyzer": "keyword" seems to only affect search time queries "when searching using a query string" form elasticsearch docs. I must admit I don't know exactly what that means (I would guess query_string but it could also mean for searches like http://../_search?q=exact:{query here})
All three documents get exactly the same score, as you can see from the explain output they all match on "AAPL". The term always appears once in the documents (tf=1) and it appears on 211 out of 37299 documents (idf=6.1701355). The field norm is much higher since you are using index time boosting (the boost part in your mapping, 10), anyway no big deal since the match is always on the same field. It's just that if you have a match on other fields exact would pretty much always win, which might make sense in your case.
But the problem is that AAPL P JAN 2014 885,00 is not an exact match if I look at your documents. What I do see is that out of the 5 terms in your query only one matches, which is confirmed by the coord too in your explain output: coord(1/5)`.
The keyword analyzer seems to be applied, but as you see from the returned documents you are not sending the content of the exact field as a single value, but as an array of values. Each of its item won't be tokenized, since you are using the keyword analyzer, but still you have multiple tokens. I guess you have to check how you're indexing documents.
The reason why your keyword analyzer seems to be ignored in the search query is because ES tokenizes this string twice - first it runs its DSL tokenizer and then it runs the tokenizer specified in the maping on the rezult. This is explained in more detail in this article http://paulsabou.com/blog/2012/03/25/advanced-exact-matching-with-elastic-search/
You should NOT ANALYZE your id field.
Define your field as:
"exact":{
"type":"string",
"index":"not_analyzed"
}
Have a look at Finding Exact Values
I have some documents which have the same content but when I try to query for these documents, I am getting different scores although the queried field contains the same text. I have explained the scores but I am not able to analyse and find the reason for different scores.
My query is
curl 'localhost:9200/acqindex/_search?pretty=1' -d '{
"explain" : true,
"query" : {
"query_string" : {
"query" : "text:shimla"
}
}
}'
Search response :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 31208,
"max_score" : 268.85962,
"hits" : [ {
"_shard" : 0,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efec6c38cc6fdabd8653a3",
"_score" : 268.85962, "_source" : {"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.85962,
"description" : "sum of:",
"details" : [ {
"value" : 38.438133,
"description" : "weight(text:shi in 5860), product of:",
"details" : [ {
"value" : 0.37811017,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 101.658554,
"description" : "fieldWeight(text:shi in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.0829277,
"description" : "idf(docFreq=7503, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 66.8446,
"description" : "weight(text:shim in 5860), product of:",
"details" : [ {
"value" : 0.49862078,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 134.05899,
"description" : "fieldWeight(text:shim in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.7029495,
"description" : "idf(docFreq=1484, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.75818,
"description" : "weight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 0.5514458,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.2615,
"description" : "fieldWeight(text:shiml in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.413075,
"description" : "idf(docFreq=729, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
}, {
"value" : 81.8187,
"description" : "weight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 0.55164987,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 0.074388266,
"description" : "queryNorm"
} ]
}, {
"value" : 148.31636,
"description" : "fieldWeight(text:shimla in 5860), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.415818,
"description" : "idf(docFreq=727, maxDocs=445129)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=5860)"
} ]
} ]
} ]
}
}, {
"_shard" : 1,
"_node" : "KOebAnGhSJKUHLPNxndcpQ",
"_index" : "acqindex",
"_type" : "autocomplete_questions",
"_id" : "50efed1c38cc6fdabd8b8d2f",
"_score" : 268.29953, "_source" : {"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal pradesh,IN","category":["Hill","See and Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"},
"_explanation" : {
"value" : 268.29953,
"description" : "sum of:",
"details" : [ {
"value" : 38.52957,
"description" : "weight(text:shi in 14769), product of:",
"details" : [ {
"value" : 0.37895453,
"description" : "queryWeight(text:shi), product of:",
"details" : [ {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 101.67334,
"description" : "fieldWeight(text:shi in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shi)=1)"
}, {
"value" : 5.083667,
"description" : "idf(docFreq=7263, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 66.67524,
"description" : "weight(text:shim in 14769), product of:",
"details" : [ {
"value" : 0.49850821,
"description" : "queryWeight(text:shim), product of:",
"details" : [ {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 133.74953,
"description" : "fieldWeight(text:shim in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shim)=1)"
}, {
"value" : 6.6874766,
"description" : "idf(docFreq=1460, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.53204,
"description" : "weight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 0.5512571,
"description" : "queryWeight(text:shiml), product of:",
"details" : [ {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.90204,
"description" : "fieldWeight(text:shiml in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shiml)=1)"
}, {
"value" : 7.3951015,
"description" : "idf(docFreq=719, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
}, {
"value" : 81.56268,
"description" : "weight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 0.55136067,
"description" : "queryWeight(text:shimla), product of:",
"details" : [ {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 0.07454354,
"description" : "queryNorm"
} ]
}, {
"value" : 147.92982,
"description" : "fieldWeight(text:shimla in 14769), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(text:shimla)=1)"
}, {
"value" : 7.3964915,
"description" : "idf(docFreq=718, maxDocs=431211)"
}, {
"value" : 20.0,
"description" : "fieldNorm(field=text, doc=14769)"
} ]
} ]
} ]
}
}
}
}
The documents are :
{"_class":"com.ixigo.next.cms.model.AutoCompleteObject","_id":"50efec6c38cc6fdabd8653a3","ad":"rajasthan,IN","category":["Destination"],"ctype":"destination","eid":"503b2a65e4b032e338f0d24b","po":8.772307692307692,"text":"shimla","url":"/travel-guide/shimla"}
{"_id":"50efed1c38cc6fdabd8b8d2f","ad":"himachal
pradesh,IN","category":["Hill","See and
Do","Destination","Mountain","Nature and Wildlife"],"ctype":"destination","eid":"503b2a64e4b032e338f0d0af","po":8.781970310391364,"text":"shimla","url":"/travel-guide/shimla"}
Please guide me in understanding the reason for the difference in scores.
The lucene score depends on different factors. Using the tf idf similarity (default one) it mainly depends on:
Term frequency: how much the terms found are frequent within the document
Inverted document frequency: how much the terms found appear among the documents (while index)
Field norms (including index time boosting). Shorter fields get higher score than longer ones.
In your case you have to take into account that your two documents come from different shards, thus the score is computed separately on each of those, since every shard is in fact a separate lucene index.
You might want to have a look at the more expensive DFS, Query then Fetch search type that elasticsearch provides for more accurate scoring. The default one is the simple Query then Fetch.
javanna clearly pointed out the problem indicating that a difference of scores comes from the fact that scoring happens in multiple shards. Those shards may have different number of documents. This affects scoring algorithm.
However, authors of Elasticsearch: The Definitive Guide inform:
The differences between local and global IDF [inverse document frequency] diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
You should not use dfs_query_then_fetch on production. For testing, put your index on one primary shard or specify ?search_type=dfs_query_then_fetch.