Elasticsearch is not returning a document I expect in the search results

Elasticsearch is not returning a document I expect in the search results - elasticsearch

I have a collection of customers that have a first name, last name, email, description and owner id. I want to take a character string from the app, and search on all the fields, with a priority order. Im using boost to achieve that.
Currently I have a lot of test customers with the name Sean in various fields within the documents. I have 2 documents that contain an email with sean.jones#email.com. One document contains the same email in the description.
When I perform the following search, im missing the document in the search results that does not contain the email in the description.
Here is my query:
{
"query" : {
"bool" : {
"filter" : {
"match" : {
"ownerId" : "acct_123"
}
},
"must" : [
{
"bool" : {
"should" : [
{
"prefix" : {
"firstName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"prefix" : {
"lastName" : {
"value" : "sean",
"boost" : 3
}
}
},
{
"terms" : {
"boost" : 2,
"description" : [
"sean"
]
}
},
{
"prefix" : {
"email" : {
"value" : "sean",
"boost" : 1
}
}
}
]
}
}
]
}
}
}
Here is the document that Im missing:
{
"_index" : "xxx",
"_id" : "cus_123",
"_version" : 1,
"_type" : "customers",
"_seq_no" : 9096,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstName" : null,
"id" : "cus_123",
"lastName" : null,
"email" : "sean.jones#email.com",
"ownerId" : "acct_123",
"description" : null
}
}
When I look at the current results, all of the documents have a score of 3.0. They have "Sean" in the name as well, so they score higher. When I do an _explain on the document im missing, with the query above, I get the following:
{
"_index": "xxx",
"_type": "customers",
"_id": "cus_123",
"matched": true,
"explanation": {
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "sum of:",
"details": [
{
"value": 1.0,
"description": "ConstantScore(email._index_prefix:sean)",
"details": []
}
]
},
{
"value": 0.0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0.0,
"description": "# clause",
"details": []
},
{
"value": 1.0,
"description": "ownerId:acct_123",
"details": []
}
]
}
]
}
}
Here are my mappings:
{
"properties": {
"firstName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"email": {
"analyzer": "my_email_analyzer",
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"lastName": {
"type": "text",
"index_prefixes": {
"max_chars": 10,
"min_chars": 1
}
},
"description": {
"type": "text"
},
"ownerId": {
"type": "text"
}
}
}
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
If im understanding this correctly, because this document is only scoring a 1, its not meeting a particular threshold. Ive tried adjusting the min_score but I had no luck. Any thoughts on how I can get this document to be included in the search results?
thanks so much

It depends on what mean by "missing":
is it, that the document does not make it into the number of hits (the "total")?
or is it, that the document itself does not show up as a hit in the hits list?
If it's #2 you may want to increase the number of documents Elasticsearch fetches and returns, by adding a size-clause to your search request (default size is 10):
Example
"size": 50

Related

ElasticSearch Query fields based on conditions on another field

Mapping
PUT /employee
{
"mappings": {
"post": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"properties":{
"id" : { "type" : "integer"},
"value" : { "type" : "keyword"}
}
},
"primary_email_id":{
"type": "integer"
}
}
}
}
}
Data
POST employee/post/1
{
"name": "John",
"email_ids": [
{
"id" : 1,
"value" : "1#email.com"
},
{
"id" : 2,
"value" : "2#email.com"
}
],
"primary_email_id": 2 // Here 2 refers to the id field of email_ids.id (2#email.com).
}
I need help to form a query to check if an email id is already taken as a primary email?
eg: If I query for 1#email.com I should get result as No as 1#email.com is not a primary email id.
If I query for 2#email.com I should get result as Yes as 2#email.com is a primary email id for John.

As far as i know with this mapping you can not achive what you are expecting.
But, You can create email_ids field as nested type and add one more field like isPrimary and set value of it to true whenever email is primary email.
Index Mapping
PUT employee
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"value": {
"type": "keyword"
},
"isPrimary":{
"type": "boolean"
}
}
},
"primary_email_id": {
"type": "integer"
}
}
}
}
Sample Document
POST employee/_doc/1
{
"name": "John",
"email_ids": [
{
"id": 1,
"value": "1#email.com"
},
{
"id": 2,
"value": "2#email.com",
"isPrimary": true
}
],
"primary_email_id": 2
}
Query
You need to keep below query as it is and only need to change email address when you want to see if email is primary or not.
POST employee/_search
{
"_source": false,
"query": {
"nested": {
"path": "email_ids",
"query": {
"bool": {
"must": [
{
"term": {
"email_ids.value": {
"value": "2#email.com"
}
}
},
{
"term": {
"email_ids.isPrimary": {
"value": "true"
}
}
}
]
}
}
}
}
}
Result
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.98082924,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.98082924
}
]
}
}
Interpret Result:
Elasticsearch will not return result in boolean like true or false but you can implement it at application level. You can consider value of hits.total.value from result, if it is 0 then you can consider false otherwise true.
PS: Answer is based on ES version 7.10.

How to get per term statistics in Elasticsearch

I need to implement the following (on the backend): a user types a query and gets back hits as well as statistics for the hits. Below is a simplified example.
Suppose the query is Grif, then the user gets back (random words just for example)
Griffith
Griffin
Grif
Grift
Griffins
And frequency + number of documents a certain term occurs in, for example:
Griffith (freq 10, 3 docs)
Griffin (freq 17, 9 docs)
Grif (freq 6, 3 docs)
Grift (freq 9, 5 docs)
Griffins (freq 11, 4 docs)
I'm relatively new to Elasticsearch, so I'm not sure where to start to implement something like this. What type of query is the most suitable for this? What can I use to get that kind of statistics? Any other advice will be appreciated too.

There are multiple layers to this. You'd need:
n-gram / partial / search-as-you-type matching
a way to group the matched keywords by their original form
a mechanism to reversely look up the document & term frequencies.
I'm not aware of any way to achieve this in one go, but here's my take on it.
You could start off with a special, n-gram-powered analyzer, as explained in my other answer. There's the original content field, plus a multi-field mapping for the said analyzer, plus a keyword field to aggregate on down the line:
PUT my-index
{
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Next, bulk-insert some sample docs containing text inside the content field. Note that each doc has an _id too — you'll need those later on.
POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}
Search for n-grams in the .analyzed field and group the matched documents by the original terms through the terms aggregation. At the same time, retrieve the _id of one of the bucketed documents through the top_hits aggregation. BTW — it doesn't matter which _id is returned in a given bucket — all will have contained the same bucketed term.
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
"size": 0,
"query": {
"term": {
"content.analyzed": "grif"
}
},
"aggs": {
"full_terms": {
"terms": {
"field": "content.keyword",
"size": 10
},
"aggs": {
"top_doc": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
Observe the response. The filter_path URL parameter from the previous request reduces the response to just those attributes that we need — the untouched, original full_terms plus one of the underlying IDs:
{
"aggregations" : {
"full_terms" : {
"buckets" : [
{
"key" : "Griffins",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "5"
}
]
}
}
},
{
"key" : "Griffith",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "1"
}
]
}
}
},
{
"key" : "Grif",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "3"
}
]
}
}
},
{
"key" : "Griffin",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "2"
}
]
}
}
},
{
"key" : "Grift",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "4"
}
]
}
}
}
]
}
}
}
Time for the fun part.
There's a specialized Elasticsearch API called Term Vectors which does exactly what you're after — it retrieves field & term stats from the whole index. In order for it to hand these stats over to you, it needs the document IDs — which you'll have obtained from the above aggregation!
Finally, since you've got multiple term vectors to work with, you can use the Multi term vectors API like so — again condensing the response thru filter_path:
POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
"docs": [
{
"_id": "5", <--- guaranteeing
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "1", <--- the response
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "3", <--- order
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "2",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "4",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
}
]
}
The result can be post-processed in your backend to form your autocomplete response. You've got A) the full terms, B) the number of matching documents (doc_freq), and C), the term frequency:
{
"docs" : [
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffins" : { | term
"doc_freq" : 2, | <-- # of docs
"term_freq" : 1 | term frequency
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffith" : {
"doc_freq" : 2,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grif" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffin" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grift" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
}
]
}
Shameless plug: if you're new to Elasticsearch and, just like me, learn best from real-world examples, consider buying my Elasticsearch Handbook.

How to store real estate data in an elastic search?

I have Real Estate data. I am looking into storing it into elastic search to allow users to search the database real time.
I want to be able to let my users search by key fields like price, lot size, year-built, total bedrooms, etc. However, I also want to be able to let the user filter by keywords or amenities like "Has Pool", "Has Spa", "Parking Space", "Community"..
Additionally, I need to keep a distinct list of property type, property status, schools, community, etc so I can create drop down menu for my user to select from.
What should the stored data structure look like? How can I maintain a list of the distinct schools, community, type to use that to create drop down menu for the user to pick from?
The current data I have is basically a key/value pairs. I can clean it up and standardize it before storing it into Elastic Search but puzzled on what is considered a good approach to store this data?

Based on your question I will provide baseline mappings and a basic query with facets/filters for you to start working with.
Mappings
PUT test_jay
{
"mappings": {
"properties": {
"amenities": {
"type": "keyword"
},
"description": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"status": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
We will use "keyword" field type for that fields you will be always be doing exact matches like a drop down list.
For fields we want to do only full text search like description we use type "text". In some cases like titles I want to have both field types.
I created a location geo_type field in case you want to put your properties in a map or do distance based searches, like near houses.
For amenities a keyword field type is enough to store an array of amenities.
Ingest document
POST test_jay/_doc
{
"name": "Nice property",
"description": "nice located fancy property",
"location": {
"lat": 37.371623,
"lon": -122.003338
},
"amenities": [
"Pool",
"Parking",
"Community"
],
"type": "House",
"status": "On sale"
}
Remember keyword fields are case sensitive!
Search query
POST test_jay/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "nice",
"fields": [
"name",
"description"
]
}
},
"filter": [
{
"term": {
"status": "On sale"
}
},
{
"term": {
"amenities":"Pool"
}
},
{
"term": {
"type": "House"
}
}
]
}
},
"aggs": {
"amenities": {
"terms": {
"field": "amenities",
"size": 10
}
},
"status": {
"terms": {
"field": "status",
"size": 10
}
},
"type": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
The multi match part will do a full text search in the title and description fields. You are filling this one with the regular search box.
Then the filter part is filled by dropdown lists.
Query Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_jay",
"_type" : "_doc",
"_id" : "zWysGHgBLiMtJ3pUuvZH",
"_score" : 0.2876821,
"_source" : {
"name" : "Nice property",
"description" : "nice located fancy property",
"location" : {
"lat" : 37.371623,
"lon" : -122.003338
},
"amenities" : [
"Pool",
"Parking",
"Community"
],
"type" : "House",
"status" : "On sale"
}
}
]
},
"aggregations" : {
"amenities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Community",
"doc_count" : 1
},
{
"key" : "Parking",
"doc_count" : 1
},
{
"key" : "Pool",
"doc_count" : 1
}
]
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "House",
"doc_count" : 1
}
]
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "On sale",
"doc_count" : 1
}
]
}
}
}
With the query response you can fill the facets for future filters.
I recommend you to play around with this and then come back with more specific questions.

Elastic Multimatch Query doesn't match document

I am querying elastic (v6.7) for items that match the phrase "x-ray" with the query below:
POST item/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"mpn",
"product_description"
"manufacturer_name"
],
"operator": "and",
"analyzer": "standard"
}
}
}
}
}
The result set is empty.
I have item documents that contain the phrase "x-ray". For example if I query:
GET items/_doc/3e4a2d80-9d5e-11e7-a6c5-6ddf18575461
It returns:
{
"_index": "items",
"_type": "_doc",
"_id": "3e4a2d80-9d5e-11e7-a6c5-6ddf18575461",
"_version": 1,
"_seq_no": 7605,
"_primary_term": 1,
"found": true,
"_source": {
"manufacturer_name": "GE",
"var_pricing": 0,
"on_hand": 1,
...
"product_description": "Portable X-Ray w/Fuji CR Reader", <----This should be a match!
"project_id": null,
"user_id": "12",
"quote_items": [],
"parentCategory": [
0
]
}
}
If I run a query on a freshly installed version of elastic (v7.3) where I add three documents like so:
POST product/_bulk
{"index":{"_id":1001}}
{"name":"x-ray Machine","price":152000,"in_stock":38,"sold":47,"tags":["Alcohol","Wine"],"description":"x-ray machine for x-rays","is_active":true,"created":"2004\/05\/13"}
{"index":{"_id":1002}}
{"name":"X-Ray film","price":99,"in_stock":10,"sold":430,"tags":[],"description":"just some x-ray film","is_active":true,"created":"2007\/10\/14"}
{"index":{"_id":1003}}
{"name":"Table","price":2500,"in_stock":24,"sold":215,"tags":[],"description":"could be used for an x-ray table","is_active":true,"created":"2000\/11\/17"}
Then query with:
POST product/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"name",
"description"
],
"operator": "and",
"analyzer": "standard"
}
}
}
}
}
All three items are returned:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 31.876595,
"hits" : [
{
"_index" : "product",
"_type" : "default",
"_id" : "1001",
"_score" : 31.876595,
"_source" : {
"name" : "x-ray Machine",
"price" : 152000,
"in_stock" : 38,
"sold" : 47,
"tags" : [
"Alcohol",
"Wine"
],
"description" : "x-ray machine for x-rays",
"is_active" : true,
"created" : "2004/05/13"
}
},
{
"_index" : "product",
"_type" : "default",
"_id" : "1002",
"_score" : 27.347116,
"_source" : {
"name" : "X-Ray film",
"price" : 99,
"in_stock" : 10,
"sold" : 430,
"tags" : [ ],
"description" : "just some x-ray film",
"is_active" : true,
"created" : "2007/10/14"
}
},
{
"_index" : "product",
"_type" : "default",
"_id" : "1003",
"_score" : 25.889376,
"_source" : {
"name" : "Table",
"price" : 2500,
"in_stock" : 24,
"sold" : 215,
"tags" : [ ],
"description" : "could be used for an x-ray table",
"is_active" : true,
"created" : "2000/11/17"
}
}
]
}
}
What gives?
I used the explain API to get some more insight but all it says is that there isn't a match:
POST items/_doc/3e4a2d80-9d5e-11e7-a6c5-6ddf18575461/_explain
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"type": "phrase_prefix",
"query": "X-Ray",
"fields": [
"product_description",
"mpn",
"manufacturer_name"
],
"operator": "and",
"analyzer": "standard"
}}
]
}
}
}
}
Returns:
{
"_index": "items",
"_type": "_doc",
"_id": "3e4a2d80-9d5e-11e7-a6c5-6ddf18575461",
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause (((+product_description:x +product_description:ray) | (+mpn:x +mpn:ray) | (+manufacturer_name:x +manufacturer_name:ray)))",
"details": [
{
"value": 0,
"description": "No matching clause",
"details": []
}
]
},
{
"value": 0,
"description": "no match on required clause (MatchNoDocsQuery(\"Type list does not contain the index type\"))",
"details": [
{
"value": 0,
"description": "MatchNoDocsQuery(\"Type list does not contain the index type\") doesn't match id 12556",
"details": []
}
]
}
]
}
}
Not much changes when I change the analyzer to whitespace or keyword either.

( this is not answer but I could not type all this up in a comment)
I am not sure you really needed to use analyzer with your query if you intended to match X-Ray as a whole.
look at this
POST _analyze
{
"analyzer": "standard",
"text":"X-Ray"
}
and the response is
{
"tokens" : [
{
"token" : "x",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ray",
"start_offset" : 2,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
so your search term X-Ray became x and ray. Is this what you intended?

So I determined my problem was that the standard analyzer is being applied all the time because it was set in the mappings to use a custom analyzer (which used the standard analyzer).
shown here:
GET items/_mapping
shows
...
"manufacturer_name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
},
"analyzer": "my_search_analyzer",
"search_analyzer": "standard"
},
...
This is the same for the other two index fields I was querying for.
The lesson here:
Check the mappings to assure no custom analyzers have been set for certain fields if you are having issues with search.

Elasticsearch: Influence scoring with custom score field in document pt.2

Having these documents:
{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
and
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.
I found this question in SO, Elasticsearch: Influence scoring with custom score field in document
But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:
{
"query": {
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "coast landscape"
}
},
"script_score": {
"script": "doc[\"confidence\"].value"
}
}
}
}
}
}
I also tried what #yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?
Just to have complete picture, here is the mapping definition that I've used:
{
"mappings": {
"photo": {
"properties": {
"created_at": {
"type": "date"
},
"description": {
"type": "text"
},
"height": {
"type": "short"
},
"id": {
"type": "keyword"
},
"tags": {
"type": "nested",
"properties": {
"tag": { "type": "string" },
"confidence": { "type": "float"}
}
},
"width": {
"type": "short"
},
"color": {
"type": "string"
},
"boost_multiplier": {
"type": "float"
}
}
}
},
"settings": {
"number_of_shards": 1
}
}
UPDATE
Following the answer of #Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635
{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "2",
"tags" : [
{
"confidence" : 84.09123410403951,
"tag" : "mountain"
},
{
"confidence" : 56.412795342449456,
"tag" : "valley"
},
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
{
"confidence" : 40.51100450186575,
"tag" : "mountains"
},
{
"confidence" : 33.14263528292239,
"tag" : "sky"
},
{
"confidence" : 31.064394646169404,
"tag" : "peak"
},
{
"confidence" : 29.372,
"tag" : "natural elevation"
}
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
"created_at" : "2017-07-31T20:30:14-04:00",
"description" : null,
"height" : 3213,
"id" : "1",
"tags" : [
{
"confidence" : 65.48948436785749,
"tag" : "beach"
},
{
"confidence" : 57.31950504425406,
"tag" : "sea"
},
{
"confidence" : 43.58207236617374,
"tag" : "coast"
},
{
"confidence" : 35.6857910950816,
"tag" : "sand"
},
{
"confidence" : 33.660057321079655,
"tag" : "landscape"
},
{
"confidence" : 32.53252312423727,
"tag" : "sky"
}
],
"width" : 5712,
"color" : "#0C0A07",
"boost_multiplier" : 1
}
}]}}
UPDATE-2
I found this one on SO: Elasticsearch: "function_score" with "boost_mode":"replace" ignores function score
It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.
FINAL UPDATE
Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot #Joanna, it works perfectly!!!

You may try this query - it combines scoring with both: confidence and boost_multiplier fields:
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [{
"nested": {
"path": "tags",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"tags.tag": "landscape"
}
},
"field_value_factor": {
"field": "tags.confidence",
"factor": 1,
"missing": 0
}
}
}
}
}]
}
},
"field_value_factor": {
"field": "boost_multiplier",
"factor": 1,
"missing": 0
}
}
}
}
When I search with coast term - it returns:
document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.
When I search with landscape term - it returns two documents:
document with id=2 and scoring "_score": 85.83046
document with id=1 and scoring "_score": 59.7339
As document with id=2 has higher value of confidence field, it gets higher scoring.
When I search with coast landscape term - it returns two documents:
document with id=1 and scoring "_score": 160.00859
document with id=2 and scoring "_score": 85.83046
Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.
boost_muliplier field
More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:
{
"created_at" : "2017-07-31T20:43:17-04:00",
"description" : null,
"height" : 4934,
"id" : "3",
"tags" : [
...
{
"confidence" : 48.36547551196872,
"tag" : "landscape"
},
...
],
"width" : 4016,
"color" : "#FEEBF9",
"boost_multiplier" : 4
}
Running the same query with coast landscape term returns three documents:
document with id=3 and scoring "_score": 360.02664
document with id=1 and scoring "_score": 182.09859
document with id=2 and scoring "_score": 90.00666
Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch is not returning a document I expect in the search results - elasticsearch

Related

ElasticSearch Query fields based on conditions on another field

How to get per term statistics in Elasticsearch

How to store real estate data in an elastic search?

Elastic Multimatch Query doesn't match document

Elasticsearch: Influence scoring with custom score field in document pt.2

Categories

Resources