Elasticsearch re-indexing same document causing score changes

Elasticsearch re-indexing same document causing score changes - elasticsearch

We have created an index with the document
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Osaka"
}
there is only one document in the index, when we are performing _explain api using match query on the index
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
Explain api returns below details
score : 0.2876821
number of documents containing term : 1
total number of documents with field : 1
{
"_index" : "sample-index-test",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.2876821,
"description" : "weight(first_name:james in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 1,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
}
Now, running the same index request multiple times in the span of seconds
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Cena"
}
Again running the same _explain api returns a different score with number of documents containing term and total number of documents with field.
score : 0.046520013
number of documents containing term : 10
total number of documents with field : 10
{
"_index" : "sample-index-test",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.046520013,
"description" : "weight(first_name:james in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.046520013,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.046520017,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 10,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
}
Why elasticsearch increasing count of total number of documents with field and number of documents containing term, same time index only contains a single document?

Elasticsearch using Lucene and all the documents stored in segments. And the segments are immutable, and document update is a 2-step process. When a document is updated, then a new document is created, and the old document is marked as deleted. So, when you create the first document in the segments, there are just only one documents. Then you update the same document 10 times, the number of deleted documents will be 9, and the latest document will be 1. For this reason, "the number of documents with field" and "number of documents containing term" is changing.
You can test with with using _forcemerge endpoint. Force Merge will merge the segments and clear the deleted documents from the segments.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
## 1. Create the document
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Osaka"
}
## 2. Get the explain score
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.2876821,
## n, number of documents containing term => 1
## N, total number of documents with field => 1
## 3.1. Execute this 10 times
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Cena"
}
## 3.2 You can execute this one also
POST sample-index-test/_update/1
{
"script" : "ctx._source.first_name = 'James'; ctx._source.last_name = 'Cena';"
}
## 3.3 Even you can use _update_by_query
POST sample-index-test/_update_by_query
{
"query": {
"match": {
"first_name": "James"
}
},
"script": {
"source": "ctx._source.first_name = 'James'; ctx._source.last_name = 'Cena';",
"lang": "painless"
}
}
## 4. Get the explain score
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.046520013,
## n, number of documents containing term => 10
## N, total number of documents with field => 10
## 5. Execute the force merge.
POST sample-index-test/_forcemerge
## 6. The ForceMerge will start in the background. So, you need to wait a couple of seconds.
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.2876821,
## n, number of documents containing term => 1
## N, total number of documents with field => 1

Related

Normalization of term frequency in elasticsearch

I recently started working with elasticsearch (version 7.17.2) and there is something related to term frequency normalization and boosting that I don't quite understand.
To keep it simple, suppose I just create an index with
PUT test
and add a couple of documents
POST test/_doc/1
{
"firstname": "foo",
"lastname": "bar"
}
POST test/_doc/2
{
"firstname": "foo",
"lastname": "baz"
}
Now I want to perform the following search
POST test/_search
{
"explain": true,
"query": {
"bool": {
"should": {
"multi_match": {
"fields": [
"firstname^3",
"lastname^5"
],
"query": "foo bar"
}
}
}
}
}
which returns
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 3.465736,
"hits" : [
{
"_shard" : "[test][0]",
"_node" : "Or9Q1aPLTi-liJvA8NJW6g",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.465736,
"_source" : {
"firstname" : "foo",
"lastname" : "bar"
},
"_explanation" : {
"value" : 3.465736,
"description" : "max of:",
"details" : [
{
"value" : 0.5469647,
"description" : "sum of:",
"details" : [
{
"value" : 0.5469647,
"description" : "weight(firstname:foo in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5469647,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 6.6000004,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 3.465736,
"description" : "sum of:",
"details" : [
{
"value" : 3.465736,
"description" : "weight(lastname:bar in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 3.465736,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 11.0,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.6931472,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[test][0]",
"_node" : "Or9Q1aPLTi-liJvA8NJW6g",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5469647,
"_source" : {
"firstname" : "foo",
"lastname" : "baz"
},
"_explanation" : {
"value" : 0.5469647,
"description" : "max of:",
"details" : [
{
"value" : 0.5469647,
"description" : "sum of:",
"details" : [
{
"value" : 0.5469647,
"description" : "weight(firstname:foo in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5469647,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 6.6000004,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
]
}
}
I purposedly gave more relevance to lastname with respect to firstname (5 vs. 3). In the explanation, for instance for the contribution of firstname:foo, the score is computed as boost * idf * tf.
While I gave the field firstname a relevance boost of 3, its actual boost according to the explanation is 6.6. After some investigation, I figured out that this value corresponds to 3 * (1.2 + 1), that is my boost of 3 mutiplied by (k_1 + 1), where k_1 corresponds to the parameter of the default BM25 similarity function, whose default value is 1.2.
I know this might be related to some normalization that elasticsearch performs behind the scenes (whose documentation is rather poor), but I have seen this happening in two ways:
Exactly as in this example, with tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)).
Like they do it on wikipedia, with tfNorm = (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)). Notice already that the value is called tfNorm instead of just tf, and that the (k1 + 1) factor appears explicitly in the tfNorm and not "hidden" in the boost. Here are the wikipedia elasticsearch settings and mappings, in case they help.
What I would like to clarify is what is the difference between these two behaviors and how to switch between them, perhaps by updating the mapping.
BONUS QUESTION: Actually, there is a third option, that we can see in the same wikipedia example, searching for the field all_near_match. There, tfNorm = (freq * (k1 + 1)) / (freq + k1), and there is an annotation saying that the b parameter in the BM25 similarity function is 0 because norms omitted for field. How does this other approach relate with the other two I described above?
Thank you very much!

Elasticsearch ngram boost substring matches beginning or near beginning of the string

I have this ngram setting:
"settings": {
"max_ngram_diff": 20,
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": "lowercase",
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"whitespace",
"custom"
],
"custom_token_chars": "-:."
}
}
}
To analyze SSN and randomgenerated numbers.
"SSN": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"RandomGenNumbers": {
"type": "text",
"analyzer": "ngram_analyzer"
}
When searching on both fields like this:
{
"match": {
"RandomGenNumbers": {
"analyzer": "standard",
"minimum_should_match": "100%",
"query": "199"
}
}
},
{
"match": {
"SSN": {
"analyzer": "standard",
"minimum_should_match": "100%",
"query": "199"
}
}
}
I was expecting to get SSN: 199012121234 first before RandomGenNumbers: 23381990. But I'm getting RandomGenNumbers first with 7.6 score while SSN had 3.1 in score.
When I explain the search result it seems like it got higher score because there are more documents with the field (N) and less on terms (n) if you look in the (idf) formula?
"value" : 7.617782,
"description" : "weight(RandomGenNumbers:199 in 6588) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 7.617782,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 6.359767,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 134,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 77755,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
}
Here is the SSN with lower score:
"value" : 3.146309,
"description" : "weight(SSN:199 in 6131) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 3.146309,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 2.2155435,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 6938,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 63600,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
}
Same when searching "990".
Is there a way to boost substring closer to the beginning of the string?

The score is higher whenever the query makes a more substantial portion of the term. In your case, 199 is 3/8 of 23381990 but only 3/12 of 199012121234 so first is rated higher.
To boost matches at the beginning of the term you can use a boosted query with regex 199.* (slower searches) or edge n-grams subfield (larger indices, slower indexing). If the performance is ok, you could use regex for "near-beginning" matches as well, otherwise you might need to use some more edge-ngram fields and populate them on indexing removing one or few leading characters.

Elasticsearch: unexpected relevancy score for optional fields in documents

I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:
Test data:
DELETE /my-index
PUT /my-index
POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
Search Query:
If I run a search query similar to one below:
GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}
Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field
Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.
A closer look at _explain
Here is the _explain results of running the same search query for Document #1:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4816045,
"description" : "max of:",
"details" : [
{
"value" : 1.4816045,
"description" : "sum of:",
"details" : [
{
"value" : 1.4816045,
"description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.4816045,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.4816046,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
And here is the _explain results of running the same search query for Document #10:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "10",
"matched" : true,
"explanation" : {
"value" : 0.36464313,
"description" : "max of:",
"details" : [
{
"value" : 0.36464313,
"description" : "sum of:",
"details" : [
{
"value" : 0.18232156,
"description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.18232156,
"description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.
Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?
Any help would be greatly appreciated. Thanks.

Your situation seems to be similar to the one described in the cross_fields query type so you should probably try it:
{
"multi_match": {
"query": "RareWord AnotherRareWord",
"fields": ["required_field","optional_field"],
"type": "cross_fields",
"operator": "and"
}
}

What's the difference between Id's query and Term Query when finding documents by "_id"?

I want to get document by "_id", I have 3 choices:
GET document by "_id" GET order/_doc/001
Use Id's Query, GET order/_search { "query": { "ids" : { "values" : ["001"] } } } Though Id's query takes array of Id's but I will be using it to get only one document at a time, so just passing one id in "values" : ["001"]
Use Term Query GET order/_search { "query": {"term": {"_id" : "001"}}}
I want to know what's the difference between Id's query and Term Query, performance wise and any other points that I should be aware of?
Which one I should choose (between Id's and Term Query)?
Any help is much appreciated:)

The first option is not a search and simply gets the document by id.
If you look at the execution plan of the second and third queries, you'll notice that they are identical:
Ids query:
GET order/_search
{
"explain": true,
"query": {
"ids": {
"values": ["001"]
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Term query:
GET order/_search
{
"explain": true,
"query": {
"term": {
"_id": "001"
}
}
}
Execution plan:
"_explanation" : {
"value" : 1.0,
"description" : "sum of:",
"details" : [
{
"value" : 1.0,
"description" : "ConstantScore(_id:[fe 0 1f])",
"details" : [ ]
},
{
"value" : 0.0,
"description" : "match on required clause, product of:",
"details" : [
{
"value" : 0.0,
"description" : "# clause",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "DocValuesFieldExistsQuery [field=_primary_term]",
"details" : [ ]
}
]
}
]
}
Any difference? None!

Why is queryWeight included for some result scores, but not others, in the same query?

I'm executing a query_string query with one term on multiple fields, _all and tags.name, and trying to understand the scoring. Query: {"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}. Here are the documents returned by the query:
Document 1 has an exact match on tags.name, but not on _all.
Document 8 has an exact match on both tags.name and on _all.
Document 8 should win, and it does, but I'm confused by how the scoring works out. It seems like Document 1 is getting penalized by having its tags.name score multiplied by the IDF twice, whereas Document 8's tags.name score is only multiplied by the IDF once. In short:
They both have a component weight(tags.name:animal in 0) [PerFieldSimilarity].
In Document 1, we have weight = score = queryWeight x fieldWeight.
In Document 8, we have weight = fieldWeight!
Since queryWeight contains idf, this results in Document 1 getting penalized by its idf twice.
Can anyone make sense of this?
Additional information
If I remove _all from the fields of the query, queryWeight is completely gone from the explain.
Adding "use_dis_max":true as an option has no effect.
However, additionally adding "tie_breaker":0.7 (or any value) does affect Document 8 by giving it the more-complicated formula we see in Document 1.
Thoughts: It's plausible that a boolean query (which this is) might do this on purpose to give more weight to queries that match more than one sub-query. However, this doesn't make any sense for a dis_max query, which is supposed to just return the maximum of the sub-queries.
Here are the relevant explain requests. Look for embedded comments.
Document 1 (match only on tags.name):
curl -XGET 'http://localhost:9200/questions/question/1/_explain?pretty' -d '{"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}':
{
"ok" : true,
"_index" : "questions_1390104463",
"_type" : "question",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.058849156,
"description" : "max of:",
"details" : [ {
"value" : 0.058849156,
"description" : "weight(tags.name:animal in 0) [PerFieldSimilarity], result of:",
// weight = score = queryWeight x fieldWeight
"details" : [ {
// score and queryWeight are NOT a part of the other explain!
"value" : 0.058849156,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [ {
"value" : 0.30685282,
"description" : "queryWeight, product of:",
"details" : [ {
// This idf is NOT a part of the other explain!
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 1.0,
"description" : "queryNorm"
} ]
}, {
"value" : 0.19178301,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.625,
"description" : "fieldNorm(doc=0)"
} ]
} ]
} ]
} ]
}
Document 8 (match on both _all and tags.name):
curl -XGET 'http://localhost:9200/questions/question/8/_explain?pretty' -d '{"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}':
{
"ok" : true,
"_index" : "questions_1390104463",
"_type" : "question",
"_id" : "8",
"matched" : true,
"explanation" : {
"value" : 0.15342641,
"description" : "max of:",
"details" : [ {
"value" : 0.033902764,
"description" : "btq, product of:",
"details" : [ {
"value" : 0.033902764,
"description" : "weight(_all:anim in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.033902764,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 0.70710677,
"description" : "tf(freq=0.5), with freq of:",
"details" : [ {
"value" : 0.5,
"description" : "phraseFreq=0.5"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.15625,
"description" : "fieldNorm(doc=0)"
} ]
} ]
}, {
"value" : 1.0,
"description" : "allPayload(...)"
} ]
}, {
"value" : 0.15342641,
"description" : "weight(tags.name:animal in 0) [PerFieldSimilarity], result of:",
// weight = fieldWeight
// No score or queryWeight in sight!
"details" : [ {
"value" : 0.15342641,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0"
} ]
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)"
} ]
} ]
} ]
}
}

I've no answer. Just want to mention I posted question to the Elasticsearch forum: https://groups.google.com/forum/#!topic/elasticsearch/xBKlFkq0SP0
I'll notify here when I'll get the answer.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch re-indexing same document causing score changes - elasticsearch

Related

Normalization of term frequency in elasticsearch

Elasticsearch ngram boost substring matches beginning or near beginning of the string

Elasticsearch: unexpected relevancy score for optional fields in documents

What's the difference between Id's query and Term Query when finding documents by "_id"?

Why is queryWeight included for some result scores, but not others, in the same query?

Categories

Resources