Elasticsearch is not returning a document I expect in the search results - elasticsearch

I have a collection of customers that have a first name, last name, email, description and owner id. I want to take a character string from the app, and search on all the fields, with a priority order. Im using boost to achieve that.
Currently I have a lot of test customers with the name Sean in various fields within the documents. I have 2 documents that contain an email with sean.jones#email.com. One document contains the same email in the description.
When I perform the following search, im missing the document in the search results that does not contain the email in the description.
Here is my query:
{
"query" : {
"bool" : {
"filter" : {
"match" : {
"ownerId" : "acct_123"
}
},
"must" : [
{
"bool" : {
"should" : [
{
"prefix" : {
"firstName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"prefix" : {
"lastName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"terms" : {
"boost" : 2,
"description" : [
"sean"
]
}
},
{
"prefix" : {
"email" : {
"value" : "sean",
"boost" : 1
}
}
}
]
}
}
]
}
}
}
Here is the document that Im missing:
{
"_index" : "xxx",
"_id" : "cus_123",
"_version" : 1,
"_type" : "customers",
"_seq_no" : 9096,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstName" : null,
"id" : "cus_123",
"lastName" : null,
"email" : "sean.jones#email.com",
"ownerId" : "acct_123",
"description" : null
}
}
When I look at the current results, all of the documents have a score of 3.0. They have "Sean" in the name as well, so they score higher. When I do an _explain on the document im missing, with the query above, I get the following:
{
"_index": "xxx",
"_type": "customers",
"_id": "cus_123",
"matched": true,
"explanation": {
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "ConstantScore(email._index_prefix:sean)",
"details": []
}
]
},
{
"value": 0.0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0.0,
"description": "# clause",
"details": []
},
{
"value": 1.0,
"description": "ownerId:acct_123",
"details": []
}
]
}
]
}
}
Here are my mappings:
{
"properties": {
"firstName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"email": {
"analyzer": "my_email_analyzer",
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"lastName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"description": {
"type": "text"
},
"ownerId": {
"type": "text"
}
}
}
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
If im understanding this correctly, because this document is only scoring a 1, its not meeting a particular threshold. Ive tried adjusting the min_score but I had no luck. Any thoughts on how I can get this document to be included in the search results?
thanks so much

It depends on what mean by "missing":
is it, that the document does not make it into the number of hits (the "total")?
or is it, that the document itself does not show up as a hit in the hits list?
If it's #2 you may want to increase the number of documents Elasticsearch fetches and returns, by adding a size-clause to your search request (default size is 10):
Example
"size": 50

Related

ElasticSearch Query fields based on conditions on another field

Mapping
PUT /employee
{
"mappings": {
"post": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"properties":{
"id" : { "type" : "integer"},
"value" : { "type" : "keyword"}
}
},
"primary_email_id":{
"type": "integer"
}
}
}
}
}
Data
POST employee/post/1
{
"name": "John",
"email_ids": [
{
"id" : 1,
"value" : "1#email.com"
},
{
"id" : 2,
"value" : "2#email.com"
}
],
"primary_email_id": 2 // Here 2 refers to the id field of email_ids.id (2#email.com).
}
I need help to form a query to check if an email id is already taken as a primary email?
eg: If I query for 1#email.com I should get result as No as 1#email.com is not a primary email id.
If I query for 2#email.com I should get result as Yes as 2#email.com is a primary email id for John.
As far as i know with this mapping you can not achive what you are expecting.
But, You can create email_ids field as nested type and add one more field like isPrimary and set value of it to true whenever email is primary email.
Index Mapping
PUT employee
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"value": {
"type": "keyword"
},
"isPrimary":{
"type": "boolean"
}
}
},
"primary_email_id": {
"type": "integer"
}
}
}
}
Sample Document
POST employee/_doc/1
{
"name": "John",
"email_ids": [
{
"id": 1,
"value": "1#email.com"
},
{
"id": 2,
"value": "2#email.com",
"isPrimary": true
}
],
"primary_email_id": 2
}
Query
You need to keep below query as it is and only need to change email address when you want to see if email is primary or not.
POST employee/_search
{
"_source": false,
"query": {
"nested": {
"path": "email_ids",
"query": {
"bool": {
"must": [
{
"term": {
"email_ids.value": {
"value": "2#email.com"
}
}
},
{
"term": {
"email_ids.isPrimary": {
"value": "true"
}
}
}
]
}
}
}
}
}
Result
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.98082924,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.98082924
}
]
}
}
Interpret Result:
Elasticsearch will not return result in boolean like true or false but you can implement it at application level. You can consider value of hits.total.value from result, if it is 0 then you can consider false otherwise true.
PS: Answer is based on ES version 7.10.

How to get per term statistics in Elasticsearch

I need to implement the following (on the backend): a user types a query and gets back hits as well as statistics for the hits. Below is a simplified example.
Suppose the query is Grif, then the user gets back (random words just for example)
Griffith
Griffin
Grif
Grift
Griffins
And frequency + number of documents a certain term occurs in, for example:
Griffith (freq 10, 3 docs)
Griffin (freq 17, 9 docs)
Grif (freq 6, 3 docs)
Grift (freq 9, 5 docs)
Griffins (freq 11, 4 docs)
I'm relatively new to Elasticsearch, so I'm not sure where to start to implement something like this. What type of query is the most suitable for this? What can I use to get that kind of statistics? Any other advice will be appreciated too.
There are multiple layers to this. You'd need:
n-gram / partial / search-as-you-type matching
a way to group the matched keywords by their original form
a mechanism to reversely look up the document & term frequencies.
I'm not aware of any way to achieve this in one go, but here's my take on it.
You could start off with a special, n-gram-powered analyzer, as explained in my other answer. There's the original content field, plus a multi-field mapping for the said analyzer, plus a keyword field to aggregate on down the line:
PUT my-index
{
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Next, bulk-insert some sample docs containing text inside the content field. Note that each doc has an _id too — you'll need those later on.
POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}
Search for n-grams in the .analyzed field and group the matched documents by the original terms through the terms aggregation. At the same time, retrieve the _id of one of the bucketed documents through the top_hits aggregation. BTW — it doesn't matter which _id is returned in a given bucket — all will have contained the same bucketed term.
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
"size": 0,
"query": {
"term": {
"content.analyzed": "grif"
}
},
"aggs": {
"full_terms": {
"terms": {
"field": "content.keyword",
"size": 10
},
"aggs": {
"top_doc": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
Observe the response. The filter_path URL parameter from the previous request reduces the response to just those attributes that we need — the untouched, original full_terms plus one of the underlying IDs:
{
"aggregations" : {
"full_terms" : {
"buckets" : [
{
"key" : "Griffins",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "5"
}
]
}
}
},
{
"key" : "Griffith",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "1"
}
]
}
}
},
{
"key" : "Grif",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "3"
}
]
}
}
},
{
"key" : "Griffin",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "2"
}
]
}
}
},
{
"key" : "Grift",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "4"
}
]
}
}
}
]
}
}
}
Time for the fun part.
There's a specialized Elasticsearch API called Term Vectors which does exactly what you're after — it retrieves field & term stats from the whole index. In order for it to hand these stats over to you, it needs the document IDs — which you'll have obtained from the above aggregation!
Finally, since you've got multiple term vectors to work with, you can use the Multi term vectors API like so — again condensing the response thru filter_path:
POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
"docs": [
{
"_id": "5", <--- guaranteeing
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "1", <--- the response
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "3", <--- order
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "2",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "4",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
}
]
}
The result can be post-processed in your backend to form your autocomplete response. You've got A) the full terms, B) the number of matching documents (doc_freq), and C), the term frequency:
{
"docs" : [
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffins" : { | term
"doc_freq" : 2, | <-- # of docs
"term_freq" : 1 | term frequency
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffith" : {
"doc_freq" : 2,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grif" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffin" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grift" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
}
]
}
Shameless plug: if you're new to Elasticsearch and, just like me, learn best from real-world examples, consider buying my Elasticsearch Handbook.

How to store real estate data in an elastic search?

I have Real Estate data. I am looking into storing it into elastic search to allow users to search the database real time.
I want to be able to let my users search by key fields like price, lot size, year-built, total bedrooms, etc. However, I also want to be able to let the user filter by keywords or amenities like "Has Pool", "Has Spa", "Parking Space", "Community"..
Additionally, I need to keep a distinct list of property type, property status, schools, community, etc so I can create drop down menu for my user to select from.
What should the stored data structure look like? How can I maintain a list of the distinct schools, community, type to use that to create drop down menu for the user to pick from?
The current data I have is basically a key/value pairs. I can clean it up and standardize it before storing it into Elastic Search but puzzled on what is considered a good approach to store this data?
Based on your question I will provide baseline mappings and a basic query with facets/filters for you to start working with.
Mappings
PUT test_jay
{
"mappings": {
"properties": {
"amenities": {
"type": "keyword"
},
"description": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"status": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
We will use "keyword" field type for that fields you will be always be doing exact matches like a drop down list.
For fields we want to do only full text search like description we use type "text". In some cases like titles I want to have both field types.
I created a location geo_type field in case you want to put your properties in a map or do distance based searches, like near houses.
For amenities a keyword field type is enough to store an array of amenities.
Ingest document
POST test_jay/_doc
{
"name": "Nice property",
"description": "nice located fancy property",
"location": {
"lat": 37.371623,
"lon": -122.003338
},
"amenities": [
"Pool",
"Parking",
"Community"
],
"type": "House",
"status": "On sale"
}
Remember keyword fields are case sensitive!
Search query
POST test_jay/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "nice",
"fields": [
"name",
"description"
]
}
},
"filter": [
{
"term": {
"status": "On sale"
}
},
{
"term": {
"amenities":"Pool"
}
},
{
"term": {
"type": "House"
}
}
]
}
},
"aggs": {
"amenities": {
"terms": {
"field": "amenities",
"size": 10
}
},
"status": {
"terms": {
"field": "status",
"size": 10
}
},
"type": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
The multi match part will do a full text search in the title and description fields. You are filling this one with the regular search box.
Then the filter part is filled by dropdown lists.
Query Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_jay",
"_type" : "_doc",
"_id" : "zWysGHgBLiMtJ3pUuvZH",
"_score" : 0.2876821,
"_source" : {
"name" : "Nice property",
"description" : "nice located fancy property",
"location" : {
"lat" : 37.371623,
"lon" : -122.003338
},
"amenities" : [
"Pool",
"Parking",
"Community"
],
"type" : "House",
"status" : "On sale"
}
}
]
},
"aggregations" : {
"amenities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Community",
"doc_count" : 1
},
{
"key" : "Parking",
"doc_count" : 1
},
{
"key" : "Pool",
"doc_count" : 1
}
]
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "House",
"doc_count" : 1
}
]
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "On sale",
"doc_count" : 1
}
]
}
}
}
With the query response you can fill the facets for future filters.
I recommend you to play around with this and then come back with more specific questions.

Elastic Multimatch Query doesn't match document

I am querying elastic (v6.7) for items that match the phrase "x-ray" with the query below:
POST item/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"mpn",
"product_description"
"manufacturer_name"
],
"operator": "and",
"analyzer": "standard"
}
}
}
}
}
The result set is empty.
I have item documents that contain the phrase "x-ray". For example if I query:
GET items/_doc/3e4a2d80-9d5e-11e7-a6c5-6ddf18575461
It returns:
{
"_index": "items",
"_type": "_doc",
"_id": "3e4a2d80-9d5e-11e7-a6c5-6ddf18575461",
"_version": 1,
"_seq_no": 7605,
"_primary_term": 1,
"found": true,
"_source": {
"manufacturer_name": "GE",
"var_pricing": 0,
"on_hand": 1,
...
"product_description": "Portable X-Ray w/Fuji CR Reader", <----This should be a match!
"project_id": null,
"user_id": "12",
"quote_items": [],
"parentCategory": [
0
]
}
}
If I run a query on a freshly installed version of elastic (v7.3) where I add three documents like so:
POST product/_bulk
{"index":{"_id":1001}}
{"name":"x-ray Machine","price":152000,"in_stock":38,"sold":47,"tags":["Alcohol","Wine"],"description":"x-ray machine for x-rays","is_active":true,"created":"2004\/05\/13"}
{"index":{"_id":1002}}
{"name":"X-Ray film","price":99,"in_stock":10,"sold":430,"tags":[],"description":"just some x-ray film","is_active":true,"created":"2007\/10\/14"}
{"index":{"_id":1003}}
{"name":"Table","price":2500,"in_stock":24,"sold":215,"tags":[],"description":"could be used for an x-ray table","is_active":true,"created":"2000\/11\/17"}
Then query with:
POST product/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"name",
"description"
],
"operator": "and",
"analyzer": "standard"
}
}
}
}
}
All three items are returned:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 31.876595,
"hits" : [
{
"_index" : "product",
"_type" : "default",
"_id" : "1001",
"_score" : 31.876595,
"_source" : {
"name" : "x-ray Machine",
"price" : 152000,
"in_stock" : 38,
"sold" : 47,
"tags" : [
"Alcohol",
"Wine"
],
"description" : "x-ray machine for x-rays",
"is_active" : true,
"created" : "2004/05/13"
}
},
{
"_index" : "product",
"_type" : "default",
"_id" : "1002",
"_score" : 27.347116,
"_source" : {
"name" : "X-Ray film",
"price" : 99,
"in_stock" : 10,
"sold" : 430,
"tags" : [ ],
"description" : "just some x-ray film",
"is_active" : true,
"created" : "2007/10/14"
}
},
{
"_index" : "product",
"_type" : "default",
"_id" : "1003",
"_score" : 25.889376,
"_source" : {
"name" : "Table",
"price" : 2500,
"in_stock" : 24,
"sold" : 215,
"tags" : [ ],
"description" : "could be used for an x-ray table",
"is_active" : true,
"created" : "2000/11/17"
}
}
]
}
}
What gives?
I used the explain API to get some more insight but all it says is that there isn't a match:
POST items/_doc/3e4a2d80-9d5e-11e7-a6c5-6ddf18575461/_explain
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"product_description",
"mpn",
"manufacturer_name"
],
"operator": "and",
"analyzer": "standard"
}}
]
}
}
}
}
Returns:
{
"_index": "items",
"_type": "_doc",
"_id": "3e4a2d80-9d5e-11e7-a6c5-6ddf18575461",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause (((+product_description:x +product_description:ray) | (+mpn:x +mpn:ray) | (+manufacturer_name:x +manufacturer_name:ray)))",
"details": [
{
"value": 0,
"description": "No matching clause",
"details": []
}
]
},
{
"value": 0,
"description": "no match on required clause (MatchNoDocsQuery(\"Type list does not contain the index type\"))",
"details": [
{
"value": 0,
"description": "MatchNoDocsQuery(\"Type list does not contain the index type\") doesn't match id 12556",
"details": []
}
]
}
]
}
}
Not much changes when I change the analyzer to whitespace or keyword either.
( this is not answer but I could not type all this up in a comment)
I am not sure you really needed to use analyzer with your query if you intended to match X-Ray as a whole.
look at this
POST _analyze
{
"analyzer": "standard",
"text":"X-Ray"
}
and the response is
{
"tokens" : [
{
"token" : "x",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ray",
"start_offset" : 2,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
so your search term X-Ray became x and ray. Is this what you intended?
So I determined my problem was that the standard analyzer is being applied all the time because it was set in the mappings to use a custom analyzer (which used the standard analyzer).
shown here:
GET items/_mapping
shows
...
"manufacturer_name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
},
"analyzer": "my_search_analyzer",
"search_analyzer": "standard"
},
...
This is the same for the other two index fields I was querying for.
The lesson here:
Check the mappings to assure no custom analyzers have been set for certain fields if you are having issues with search.

Elasticsearch: Influence scoring with custom score field in document pt.2

Having these documents:
{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
and
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.
I found this question in SO, Elasticsearch: Influence scoring with custom score field in document
But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:
{
"query": {
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "coast landscape"
}
},
"script_score": {
"script": "doc[\"confidence\"].value"
}
}
}
}
}
}
I also tried what #yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?
Just to have complete picture, here is the mapping definition that I've used:
{
"mappings": {
"photo": {
"properties": {
"created_at": {
"type": "date"
},
"description": {
"type": "text"
},
"height": {
"type": "short"
},
"id": {
"type": "keyword"
},
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string" },
"confidence": { "type": "float"}
}
},
"width": {
"type": "short"
},
"color": {
"type": "string"
},
"boost_multiplier": {
"type": "float"
}
}
}
},
"settings": {
"number_of_shards": 1
}
}
UPDATE
Following the answer of #Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635
{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
}]}}
UPDATE-2
I found this one on SO: Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score
It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.
FINAL UPDATE
Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot #Joanna, it works perfectly!!!
You may try this query - it combines scoring with both: confidence and boost_multiplier fields:
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [{
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "landscape"
}
},
"field_value_factor": {
"field": "tags.confidence",
"factor": 1,
"missing": 0
}
}
}
}
}]
}
},
"field_value_factor": {
"field": "boost_multiplier",
"factor": 1,
"missing": 0
}
}
}
}
When I search with coast term - it returns:
document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.
When I search with landscape term - it returns two documents:
document with id=2 and scoring "_score": 85.83046
document with id=1 and scoring "_score": 59.7339
As document with id=2 has higher value of confidence field, it gets higher scoring.
When I search with coast landscape term - it returns two documents:
document with id=1 and scoring "_score": 160.00859
document with id=2 and scoring "_score": 85.83046
Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.
boost_muliplier field
More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "3",
"tags" : [
...
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
...
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 4
}
Running the same query with coast landscape term returns three documents:
document with id=3 and scoring "_score": 360.02664
document with id=1 and scoring "_score": 182.09859
document with id=2 and scoring "_score": 90.00666
Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.

Resources