Elasticsearch ngram boost substring matches beginning or near beginning of the string

Elasticsearch ngram boost substring matches beginning or near beginning of the string - elasticsearch

I have this ngram setting:
"settings": {
"max_ngram_diff": 20,
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": "lowercase",
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"whitespace",
"custom"
],
"custom_token_chars": "-:."
}
}
}
To analyze SSN and randomgenerated numbers.
"SSN": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"RandomGenNumbers": {
"type": "text",
"analyzer": "ngram_analyzer"
}
When searching on both fields like this:
{
"match": {
"RandomGenNumbers": {
"analyzer": "standard",
"minimum_should_match": "100%",
"query": "199"
}
}
},
{
"match": {
"SSN": {
"analyzer": "standard",
"minimum_should_match": "100%",
"query": "199"
}
}
}
I was expecting to get SSN: 199012121234 first before RandomGenNumbers: 23381990. But I'm getting RandomGenNumbers first with 7.6 score while SSN had 3.1 in score.
When I explain the search result it seems like it got higher score because there are more documents with the field (N) and less on terms (n) if you look in the (idf) formula?
"value" : 7.617782,
"description" : "weight(RandomGenNumbers:199 in 6588) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 7.617782,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 6.359767,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 134,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 77755,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
}
Here is the SSN with lower score:
"value" : 3.146309,
"description" : "weight(SSN:199 in 6131) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 3.146309,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 2.2155435,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 6938,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 63600,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
}
Same when searching "990".
Is there a way to boost substring closer to the beginning of the string?

The score is higher whenever the query makes a more substantial portion of the term. In your case, 199 is 3/8 of 23381990 but only 3/12 of 199012121234 so first is rated higher.
To boost matches at the beginning of the term you can use a boosted query with regex 199.* (slower searches) or edge n-grams subfield (larger indices, slower indexing). If the performance is ok, you could use regex for "near-beginning" matches as well, otherwise you might need to use some more edge-ngram fields and populate them on indexing removing one or few leading characters.

Related

Elasticsearch re-indexing same document causing score changes

We have created an index with the document
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Osaka"
}
there is only one document in the index, when we are performing _explain api using match query on the index
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
Explain api returns below details
score : 0.2876821
number of documents containing term : 1
total number of documents with field : 1
{
"_index" : "sample-index-test",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.2876821,
"description" : "weight(first_name:james in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 1,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
}
Now, running the same index request multiple times in the span of seconds
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Cena"
}
Again running the same _explain api returns a different score with number of documents containing term and total number of documents with field.
score : 0.046520013
number of documents containing term : 10
total number of documents with field : 10
{
"_index" : "sample-index-test",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 0.046520013,
"description" : "weight(first_name:james in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.046520013,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.046520017,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 10,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
}
Why elasticsearch increasing count of total number of documents with field and number of documents containing term, same time index only contains a single document?

Elasticsearch using Lucene and all the documents stored in segments. And the segments are immutable, and document update is a 2-step process. When a document is updated, then a new document is created, and the old document is marked as deleted. So, when you create the first document in the segments, there are just only one documents. Then you update the same document 10 times, the number of deleted documents will be 9, and the latest document will be 1. For this reason, "the number of documents with field" and "number of documents containing term" is changing.
You can test with with using _forcemerge endpoint. Force Merge will merge the segments and clear the deleted documents from the segments.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
## 1. Create the document
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Osaka"
}
## 2. Get the explain score
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.2876821,
## n, number of documents containing term => 1
## N, total number of documents with field => 1
## 3.1. Execute this 10 times
POST sample-index-test/_doc/1
{
"first_name": "James",
"last_name" : "Cena"
}
## 3.2 You can execute this one also
POST sample-index-test/_update/1
{
"script" : "ctx._source.first_name = 'James'; ctx._source.last_name = 'Cena';"
}
## 3.3 Even you can use _update_by_query
POST sample-index-test/_update_by_query
{
"query": {
"match": {
"first_name": "James"
}
},
"script": {
"source": "ctx._source.first_name = 'James'; ctx._source.last_name = 'Cena';",
"lang": "painless"
}
}
## 4. Get the explain score
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.046520013,
## n, number of documents containing term => 10
## N, total number of documents with field => 10
## 5. Execute the force merge.
POST sample-index-test/_forcemerge
## 6. The ForceMerge will start in the background. So, you need to wait a couple of seconds.
GET sample-index-test/_explain/1
{
"query": {
"match": {
"first_name": "James"
}
}
}
## "value": 0.2876821,
## n, number of documents containing term => 1
## N, total number of documents with field => 1

Elasticsearch: unexpected relevancy score for optional fields in documents

I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:
Test data:
DELETE /my-index
PUT /my-index
POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
Search Query:
If I run a search query similar to one below:
GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}
Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field
Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.
A closer look at _explain
Here is the _explain results of running the same search query for Document #1:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4816045,
"description" : "max of:",
"details" : [
{
"value" : 1.4816045,
"description" : "sum of:",
"details" : [
{
"value" : 1.4816045,
"description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.4816045,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.4816046,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
And here is the _explain results of running the same search query for Document #10:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "10",
"matched" : true,
"explanation" : {
"value" : 0.36464313,
"description" : "max of:",
"details" : [
{
"value" : 0.36464313,
"description" : "sum of:",
"details" : [
{
"value" : 0.18232156,
"description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.18232156,
"description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.
Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?
Any help would be greatly appreciated. Thanks.

Your situation seems to be similar to the one described in the cross_fields query type so you should probably try it:
{
"multi_match": {
"query": "RareWord AnotherRareWord",
"fields": ["required_field","optional_field"],
"type": "cross_fields",
"operator": "and"
}
}

Is it possible to set the TYPE parameter using simple query string query in Elastic Search

When using a query string query in ES and matching multiple fields, I can set a TYPE paramter to configure how ES combines/scores when matching on multiple fields.
e.g. I want to match two fields in my index, and combine scores from both fields
GET /_search
{
"query": {
"query_string" : {
"query" : "test",
"fields": ["titel", "content"],
"type": "most_fields"
}
}
}
The parameter seems to be missing using the simple query string. What is the default mode for simple query string? How are scores chosen/combined? Is it possible to set type.

Simple query string doesn't have a type parameter. It does a sum of score from each field.
Consider below index and let's see how different queries calculate score using explanation api
Mapping:
PUT testindex6
{
"mappings": {
"properties": {
"title":{
"type": "text"
},
"description":{
"type": "text"
}
}
}
}
Data:
POST testindex6/_doc
{
"title": "dog",
"description":"dog is brown"
}
1. Query_string best_fields(default)
Finds documents which match any field, but uses the _score from the
best field
GET testindex6/_search?explain=true
{
"query": {
"query_string": {
"default_field": "*",
"query": "dog brown",
"type":"best_fields"
}
}
}
Result:
"_explanation" : {
"value" : 0.5753642,
"description" : "max of:",
"details" : [
{
"value" : 0.5753642,
"description" : "sum of:",
},
{
"value" : 0.2876821,
"description" : "sum of:",
}
]
}
Best_fields takes max score from matched fields
2. Query_string most_fields
Does sum of scores from matched fields
GET testindex6/_search?explain=true
{
"query": {
"query_string": {
"default_field": "*",
"query": "dog brown",
"type":"most_fields"
}
}
}
Result
"_explanation" : {
"value" : 0.8630463,
"description" : "sum of:",
"details" : [
{
"value" : 0.5753642,
"description" : "sum of:"
....
},
{
"value" : 0.2876821,
"description" : "sum of:"
....
}
]
}
}
3. Simple_Query_String
Query
GET testindex6/_search?explain=true
{
"query": {
"simple_query_string": {
"query": "dog brown",
"fields": ["*"]
}
}
}
Result:
"_explanation" : {
"value" : 0.8630463,
"description" : "sum of:",
"details" : [
{
"value" : 0.5753642,
"description" : "sum of:",
},
{
"value" : 0.2876821,
"description" : "sum of:"
}
]
}
}
So you can see score is same in most_fields and simple_query_string(both do a sum of). But there is difference in them. Consider below index
I have created a field title with type text and subfield shingles with shingles analyzer.
PUT index_2
{
"settings": {
"analysis": {
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"shingles": {
"search_analyzer": "analyzer_shingle",
"analyzer": "analyzer_shingle",
"type": "text"
}
}
}
}
}
}
Data:
POST index_2/_doc
{
"title":"the brown fox"
}
1. Most_fields
Query:
GET index_2/_search?explain=true
{
"query": {
"query_string": {
"query": "brown fox",
"fields": ["*"],
"type":"most_fields"
}
}
}
Result:
"_explanation" : {
"value" : 1.3650365,
"description" : "sum of:",
"details" : [
{
"value" : 0.7896724,
"description" : "sum of:",
},
{
"value" : 0.5753642,
"description" : "sum of:",
}
]
}
2. Simple_Query_string
Query
GET index_2/_search?explain=true
{
"query": {
"simple_query_string": {
"query": "brown fox",
"fields": ["*"]
}
}
}
Result:
"_explanation" : {
"value" : 1.2632996,
"description" : "sum of:",
"details" : [
{
"value" : 0.6316498,
"description" : "sum of:",
},
{
"value" : 0.6316498,
"description" : "sum of:"
}
]
}
}
If you will see the score is different in most_fields and simple_query_string even though both do sum of scores.
The reason is most_fields uses analyzer of field while querying ,remember titles(standard) and titles shingles(analyzer_shingle) have different analyzer while simple_query_string use default analyzer of the index(standard) for all fields.
If we will query most_fields and force it to use standard analyzer you will score is same
Query:
GET index_2/_search?explain=true
{
"query": {
"query_string": {
"query": "brown fox",
"fields": ["*"],
"type":"most_fields",
"analyzer": "standard"-->instead of field analyzer respectively use standard for all
}
}
}
Result:
"_explanation" : {
"value" : 1.2632996,
"description" : "sum of:"
"details" : [
{
"value" : 0.6879354,
"description" : "sum of:"
},
{
"value" : 0.5753642,
"description" : "sum of:"
}
]
}
simple_query_string I think is for simple scenarios, if you are using different analyzers for different field use simple_query_string or bool- match queries

Score_mode avg returns 1 for all documents

I'm using function score with score_mode avg and boost_mode replace.
And according to documentation I'm expecting the query function score to be overriden by the function score filters (because I'm using boost_mode replace).
This works as expected for the sum and multiply, but not for avg (I'm aware that the average in function score is a weighted average)
When I apply this function_score all the documents get a score of 1.
How can this happen?
GET kibana_sample_data_ecommerce/_search
{
"_source": {
"includes": ["customer_last_name", "customer_first_name", "customer_gender"]
},
"size": 10,
"query": {
"function_score": {
"functions": [
{
"filter": { "match": { "customer_last_name": "Cook" } },
"weight": 2
},
{
"filter": { "match": { "customer_first_name": "Jackson" } },
"weight": 4
},
{
"filter": { "match": { "customer_gender" : "MALE"} },
"weight": 8
}
],
"score_mode": "avg",
"boost_mode": "replace"
}
}
}

So this is a bit weird, but the link provided by #jzzfs is already pretty close. The average mode of the function score query provides a weighted average, which causes this effect:
In case score_mode is set to avg the individual scores will be combined by a weighted average. For example, if two functions return score 1 and 2 and their respective weights are 3 and 4, then their scores will be combined as (1*3+2*4)/(3+4) and not (1*3+2*4)/2.
In addition, it's important to note that due to this, function scores whose filters don't match the current document have no effect on the average score, rather then reducing it. In your example, this means that if a document only matches by having a MALE customer, it will have a score of 8, but since it's weighted it'll actually have a score of (1*8)/8 = 1. If it's a MALE with the first name Jackson, the score again will be (1*8 + 1*4)/(8+4)=1. This can easily seen by using the explain api:
GET kibana_sample_data_ecommerce/_explain/ER5Bv3ABEiTwEf3FhKws
{
"query": {
"function_score": {
"functions": [
{
"filter": { "match": { "customer_last_name": "Cook" } },
"weight": 2
},
{
"filter": { "match": { "customer_first_name": "Jackson" } },
"weight": 4
},
{
"filter": { "match": { "customer_gender" : "MALE"} },
"weight": 8
}
],
"score_mode": "avg",
"boost_mode": "replace"
}
}
}
returns
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "ER5Bv3ABEiTwEf3FhKws",
"matched" : true,
"explanation" : {
"value" : 1.0,
"description" : "min of:",
"details" : [
{
"value" : 1.0,
"description" : "function score, score mode [avg]",
"details" : [
{
"value" : 8.0,
"description" : "function score, product of:",
"details" : [
{
"value" : 1.0,
"description" : "match filter: customer_gender:MALE",
"details" : [ ]
},
{
"value" : 8.0,
"description" : "product of:",
"details" : [
{
"value" : 1.0,
"description" : "constant score 1.0 - no function provided",
"details" : [ ]
},
{
"value" : 8.0,
"description" : "weight",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
}
}

It's already answered here. Since you used boost_mode:replace, only the function scores are used & the query score gets ignored.
Based on that, since your weights are the same, they "cancel each other out" to result in 1.

Elasticsearch is not returning a document I expect in the search results

I have a collection of customers that have a first name, last name, email, description and owner id. I want to take a character string from the app, and search on all the fields, with a priority order. Im using boost to achieve that.
Currently I have a lot of test customers with the name Sean in various fields within the documents. I have 2 documents that contain an email with sean.jones#email.com. One document contains the same email in the description.
When I perform the following search, im missing the document in the search results that does not contain the email in the description.
Here is my query:
{
"query" : {
"bool" : {
"filter" : {
"match" : {
"ownerId" : "acct_123"
}
},
"must" : [
{
"bool" : {
"should" : [
{
"prefix" : {
"firstName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"prefix" : {
"lastName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"terms" : {
"boost" : 2,
"description" : [
"sean"
]
}
},
{
"prefix" : {
"email" : {
"value" : "sean",
"boost" : 1
}
}
}
]
}
}
]
}
}
}
Here is the document that Im missing:
{
"_index" : "xxx",
"_id" : "cus_123",
"_version" : 1,
"_type" : "customers",
"_seq_no" : 9096,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstName" : null,
"id" : "cus_123",
"lastName" : null,
"email" : "sean.jones#email.com",
"ownerId" : "acct_123",
"description" : null
}
}
When I look at the current results, all of the documents have a score of 3.0. They have "Sean" in the name as well, so they score higher. When I do an _explain on the document im missing, with the query above, I get the following:
{
"_index": "xxx",
"_type": "customers",
"_id": "cus_123",
"matched": true,
"explanation": {
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "ConstantScore(email._index_prefix:sean)",
"details": []
}
]
},
{
"value": 0.0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0.0,
"description": "# clause",
"details": []
},
{
"value": 1.0,
"description": "ownerId:acct_123",
"details": []
}
]
}
]
}
}
Here are my mappings:
{
"properties": {
"firstName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"email": {
"analyzer": "my_email_analyzer",
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"lastName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"description": {
"type": "text"
},
"ownerId": {
"type": "text"
}
}
}
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
If im understanding this correctly, because this document is only scoring a 1, its not meeting a particular threshold. Ive tried adjusting the min_score but I had no luck. Any thoughts on how I can get this document to be included in the search results?
thanks so much

It depends on what mean by "missing":
is it, that the document does not make it into the number of hits (the "total")?
or is it, that the document itself does not show up as a hit in the hits list?
If it's #2 you may want to increase the number of documents Elasticsearch fetches and returns, by adding a size-clause to your search request (default size is 10):
Example
"size": 50

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch ngram boost substring matches beginning or near beginning of the string - elasticsearch

Related

Elasticsearch re-indexing same document causing score changes

Elasticsearch: unexpected relevancy score for optional fields in documents

Is it possible to set the TYPE parameter using simple query string query in Elastic Search

Score_mode avg returns 1 for all documents

Elasticsearch is not returning a document I expect in the search results

Categories

Resources