Related
I've prepared an Elastic Search query in which I'm trying to fetch results from nested objects. The query looks something like this:
{
"from": 0,
"size": 100,
"_source": {
"excludes": [
"#version"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"doc.workflow_id.keyword": "workflow1"
}
},
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "color"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*green*"
}
}
]
}
}
]
}
}
}
},
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "price"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*34*"
}
}
]
}
}
]
}
}
}
}
],
"must_not": []
}
}
}
Output:
"hits" : [
{
"_index" : "sample_index",
"_type" : "_doc",
"_id" : "mv1",
"_score" : null,
"_source" : {
"doc" : {
"workflow_id" : "workflow1",
"attributes" : [
{
"name" : "price",
"value" : "34"
},
{
"name" : "weight",
"value" : "10"
},
{
"name" : "color",
"value" : "green"
},
{
"name" : "city",
"value" : "#error"
}
]
}
}
},
{
"_index" : "sample_index",
"_type" : "_doc",
"_id" : "mv2",
"_score" : null,
"_source" : {
"doc" : {
"workflow_id" : "workflow1",
"attributes" : [
{
"name" : "price",
"value" : "34"
},
{
"name" : "color",
"value" : "green"
}
]
}
}
}
]
I've omitted a few trivial details in query and output for simplicity. The attributes array in the response is of type nested and contains name and value fields of type string.
I've put filters on attributes color and price, but as you can see, I'm getting other attributes too in the attributes array. Can I somehow pass specific attribute names to the ES query and get the value of those attributes only?
I tried using inner_hits in both nested queries, but it returns the attribute value only for the passed attribute name in the nested query.
E.g.
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "color"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*green*"
}
}
]
}
}
]
}
},
"inner_hits": {
"name": "two",
"_source": [
"doc.product_attributes.name",
"doc.product_attributes.value"
]
}
}
}
gives result
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_score": null,
"_source": {
"doc": {
"workflow_id": "workflow1",
"attributes": [
{
"name": "price",
"value": "34"
},
{
"name": "weight",
"value": "34"
},
{
"name": "color",
"value": "green"
},
{
"name": "city",
"value": "#ERROR"
}
]
}
},
"inner_hits": {
"two": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_nested": {
"field": "doc.attributes",
"offset": 1
},
"_score": 0.0,
"_source": {
"name": "color",
"value": "green"
}
}
]
}
}
}
},
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv2",
"_score": null,
"_source": {
"doc": {
"workflow_id": "workflow1",
"attributes": [
{
"name": "price",
"value": "34"
},
{
"name": "color",
"value": "green"
}
]
}
},
"inner_hits": {
"two": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_nested": {
"field": "doc.attributes",
"offset": 1
},
"_score": 0.0,
"_source": {
"name": "color",
"value": "green"
}
}
]
}
}
}
}
]
}
Note the attribute name and value received inside the inner_hits object.
I want to get other attribute names and values as well in the response for which I'm putting any filter. For example, if I want to get attribute names and values for weight, color & city only, how do I do that?
I've checked this thread select matching objects from array in elasticsearch, but it doesn't solve my problem.
newbies with ElasticSearch we have docs indexed with following structure:
{
"Id": 1246761,
"ContentTypeName": "Official Statement",
"Title": "Official statement Title",
"Categories": [
{
"Id": 3,
"Type": 1,
"Name": "Category A",
"ParentId": 0
},
{
"Id": 10,
"Type": 3,
"Name": "Category B",
"ParentId": 0
},
{
"Id": 426,
"Type": 7,
"Name": "Category C",
"ParentId": 0
}
]
}
The requirement is to get the aggregated list of categories + document count matching a keyword search.
So far our query looks like this:
GET _search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"my-agg-name": {
"terms": {
"field": "Categories.Id"
}
}
}
}
Result is
{
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my-agg-name" : {
"doc_count_error_upper_bound" : 23845,
"sum_other_doc_count" : 1068245,
"buckets" : [
{
"key" : 426,
"doc_count" : 112651
},
{
"key" : 10,
"doc_count" : 91146
},
....
]
}
}
}
Is there a way to get back the entire Category object, not only the Id ?
Or serialize the category object into string as the key ?
You need to use nested aggregation to achieve your required use case
Adding a working example with index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"Categories": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"match_all": {}
},
"aggs": {
"resellers": {
"nested": {
"path": "Categories"
},
"aggs": {
"my-agg-name": {
"terms": {
"field": "Categories.Id"
},
"aggs": {
"categories-doc": {
"top_hits": {
"_source": {
"includes": [
"Categories.Id",
"Categories.Type",
"Categories.Name",
"Categories.ParentId"
]
},
"size": 1
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 3,
"my-agg-name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, // note this
"doc_count": 1,
"categories-doc": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "65847850",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "Categories",
"offset": 0
},
"_score": 1.0,
"_source": {
"ParentId": 0,
"Type": 1,
"Id": 3, // note this
"Name": "Category A"
}
}
]
}
}
},
{
"key": 10,
"doc_count": 1,
"categories-doc": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "65847850",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "Categories",
"offset": 1
},
"_score": 1.0,
"_source": {
"ParentId": 0,
"Type": 3,
"Id": 10,
"Name": "Category B"
}
}
]
}
}
},
{
"key": 426,
"doc_count": 1,
"categories-doc": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "65847850",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "Categories",
"offset": 2
},
"_score": 1.0,
"_source": {
"ParentId": 0,
"Type": 7,
"Id": 426,
"Name": "Category C"
}
}
]
}
}
}
]
}
}
}
I have use case where I need to get all unique user ids from Elasticsearch and it should be sorted by timestamp.
What I'm using currently is composite term aggregation with sub aggregation which will return the latest timestamp.
(I can't sort it in client side as it slow down the script)
Sample data in elastic search
{
"_index": "logstash-2020.10.29",
"_type": "doc",
"_id": "L0Urc3UBttS_uoEtubDk",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"#timestamp": "2020-10-29T06:56:00.000Z",
"timestamp_string": "1603954560",
"search_query": "example 3",
"user_uuid": "asdfrghcwehf",
"browsing_url": "https://www.google.com/search?q=example+3",
},
"fields": {
"#timestamp": [
"2020-10-29T06:56:00.000Z"
]
},
"sort": [
1603954560000
]
}
Expected Output:
[
{
"key" : "bjvexyducsls",
"doc_count" : 846,
"1" : {
"value" : 1.603948557E12,
"value_as_string" : "2020-10-29T05:15:57.000Z"
}
},
{
"key" : "lhmsbq2osski",
"doc_count" : 420,
"1" : {
"value" : 1.6039476E12,
"value_as_string" : "2020-10-29T05:00:00.000Z"
}
},
{
"key" : "m2wiaufcbvvi",
"doc_count" : 1,
"1" : {
"value" : 1.603893635E12,
"value_as_string" : "2020-10-28T14:00:35.000Z"
}
},
{
"key" : "rrm3vd5ovqwg",
"doc_count" : 1,
"1" : {
"value" : 1.60389362E12,
"value_as_string" : "2020-10-28T14:00:20.000Z"
}
},
{
"key" : "x42lk4t3frfc",
"doc_count" : 72,
"1" : {
"value" : 1.60389318E12,
"value_as_string" : "2020-10-28T13:53:00.000Z"
}
}
]
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
{
"size": 0,
"aggs": {
"user_id": {
"terms": {
"field": "user",
"order": {
"sort_user": "asc"
}
},
"aggs": {
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"user_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "user2",
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
},
{
"key": "user1",
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": "user3",
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
}
]
}
My mapping is the following:
PUT places
{
"mappings": {
"test": {
"properties": {
"id_product": { "type": "keyword" },
"id_product_unique": { "type": "integer" },
"location": { "type": "geo_point" },
"suggest": {
"type": "text"
},
"active": {"type": "boolean"}
}
}
}
}
POST places/test
{
"id_product" : "A",
"id_product_unique": 1,
"location": {
"lat": 1.378446,
"lon": 103.763427
},
"suggest": ["coke","zero"],
"active": true
}
POST places/test
{
"id_product" : "A",
"id_product_unique": 2,
"location": {
"lat": 1.878446,
"lon": 108.763427
},
"suggest": ["coke","zero"],
"active": true
}
POST places/test
{
"id_product" : "B",
"id_product_unique": 3,
"location": {
"lat": 1.478446,
"lon": 104.763427
},
"suggest": ["coke"],
"active": true
}
POST places/test
{
"id_product" : "C",
"id_product_unique": 4,
"location": {
"lat": 1.218446,
"lon": 102.763427
},
"suggest": ["coke","light"],
"active": true
}
In my example there is 2 can of coke zero ("id_product_unique" = 1 and 2), 1 can of coke ("id_product_unique" = 3) and one can of coke light ("id_product_unique" = 4)
All these cans are in different locations.
An "id_product" is not unique as an exact same "can of coke" can be sold in two different locations (ex "id_product_unique" = 1 and 2).
Only "id_product_unique" and "location" change from a "can of coke" to an other one (2 same "can of coke" have the same fields "suggest" and "id_product" but not the same "id_product_unique" and "location").
My goal is to search for a product from a given GPS location, and display a unique result by id_product (the closest one):
POST /places/_search?size=0
{
"aggs" : {
"group-by-type" : {
"terms" : { "field" : "id_product"},
"aggs": {
"min-distance": {
"top_hits": {
"sort": {
"_script": {
"type": "number",
"script": {
"source": "def x = doc['location'].lat; def y = doc['location'].lon; return Math.abs(x-1.178446) + Math.abs(y-101.763427)",
"lang": "painless"
},
"order": "asc"
}
},
"size" : 1
}
}
}
}
}
}
From this list of result I'd like now to apply a should query and to re-order my list of result by computed score. I tried the following:
POST /places/_search?size=0
{
"query" : {
"bool": {
"filter": {"term" : { "active" : "true" }},
"should": [
{"match" : { "suggest" : "coke" }},
{"match" : { "suggest" : "light" }}
]
}
},
"aggs" : {
"group-by-type" : {
"terms" : { "field" : "id_product"},
"aggs": {
"min-distance": {
"top_hits": {
"sort": {
"_script": {
"type": "number",
"script": {
"source": "def x = doc['location'].lat; def y = doc['location'].lon; return Math.abs(x-1.178446) + Math.abs(y-101.763427)",
"lang": "painless"
},
"order": "asc"
}
},
"size" : 1
}
}
}
}
}
}
But I cannot figure how to replace the distance sort score by the doc score.
Any help would be great.
I managed to do it by adding a new aggregation "max_score":
"max_score": {
"max": {
"script": {
"lang": "painless",
"source": "_score"
}
}
}
and by ordering by max_score.value desc:
"order": {"max_score.value": "desc"}
My final query is the following:
POST /places/_search?size=0
{
"query" : {
"bool": {
"filter": {"term" : { "active" : "true" }},
"should": [
{"match" : { "suggest" : "coke" }},
{"match" : { "suggest" : "light" }}
]
}
},
"aggs" : {
"group-by-type" : {
"terms" : {
"field" : "id_product",
"order": {"max_score.value": "desc"}
},
"aggs": {
"min-distance": {
"top_hits": {
"sort": {
"_script": {
"type": "number",
"script": {
"source": "def x = doc['location'].lat; def y = doc['location'].lon; return Math.abs(x-1.178446) + Math.abs(y-101.763427)",
"lang": "painless"
},
"order": "asc"
}
},
"size" : 1
}
},
"max_score": {
"max": {
"script": {
"lang": "painless",
"inline": "_score"
}
}
}
}
}
}
}
answer:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"group-by-type": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "C",
"doc_count": 1,
"max_score": {
"value": 1.0300811529159546
},
"min-distance": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "places",
"_type": "test",
"_id": "VhJdOmIBKhzTB9xcDvfk",
"_score": null,
"_source": {
"id_product": "C",
"id_product_unique": 4,
"location": {
"lat": 1.218446,
"lon": 102.763427
},
"suggest": [
"coke",
"light"
],
"active": true
},
"sort": [
1.0399999646503995
]
}
]
}
}
},
{
"key": "A",
"doc_count": 2,
"max_score": {
"value": 0.28768208622932434
},
"min-distance": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "places",
"_type": "test",
"_id": "UhJcOmIBKhzTB9xc6ve-",
"_score": null,
"_source": {
"id_product": "A",
"id_product_unique": 1,
"location": {
"lat": 1.378446,
"lon": 103.763427
},
"suggest": [
"coke",
"zero"
],
"active": true
},
"sort": [
2.1999999592114756
]
}
]
}
}
},
{
"key": "B",
"doc_count": 1,
"max_score": {
"value": 0.1596570909023285
},
"min-distance": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "places",
"_type": "test",
"_id": "VRJcOmIBKhzTB9xc_vc0",
"_score": null,
"_source": {
"id_product": "B",
"id_product_unique": 3,
"location": {
"lat": 1.478446,
"lon": 104.763427
},
"suggest": [
"coke"
],
"active": true
},
"sort": [
3.2999999020282695
]
}
]
}
}
}
]
}
}
}
From what I gathered, your use case is where you want to factor in the value of a particular field in your document into the calculation of the relevance score.
This is typical in scenarios where you want the boost the relevance of a document based on a value of a field, like a price or here a query for a particular product.
If you are searching for produt A, that is more important in this scenario than the distance of the products themselves. So if B is 2 miles away from origin and A is 5 miles, A is the closest of the product you are searching for.
What you need is a Function Score Query using a decay_function based on the distance. I think you want a gauss type to reflect the rate of decay, which operates like a bell curve.
Here is an example using a decay function of the exp (exponent) type. This use case is doing the same thing, but it is using a different field type (date) than
you are, but the idea should be the same.
Suppose that instead of wanting to boost incrementally by the value of
a field, you have an ideal value you want to target and you want the
boost factor to decay the further away you move from the value. This
is typically useful in boosts based on lat/long, numeric fields like
price, or dates. In our contrived example, we are searching for books
on “search engines” ideally published around June 2014.
POST /bookdb_index/book/_search
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query" : "search engine",
"fields": ["title", "summary"]
}
},
"functions": [
{
"exp": {
"publish_date" : {
"origin": "2014-06-15",
"offset": "7d",
"scale" : "30d"
}
}
}
],
"boost_mode" : "replace"
}
},
"_source": ["title", "summary", "publish_date", "num_reviews"]
}
Here are some useful references for this:
Elasticsearch 6.2 Function Score document
Elastcisearch Example Queries
The Closer the Better
This is an Elasticsearch 2x Decay Function example and even though it's a different version, I think it is very similar to your use case
need your help to understand the behaviour of elasticsearch scripting based sorting.
First of all let me paste the mappings of my elasticsearch types :
{
"nestedDateType" : {
"properties" : {
"message" : {
"properties" : {
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
},
"nonNestedDateType" : {
"properties" : {
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
now what I want to do is to query these 2 types and sort based on the date.
The problem is in nestedDateType, the date path is "message.date" where in nonNestedDateType, the date path is "date".
I understand that I have to use scripting based sort to do this. However, the script that I made did not work as expected. This is the query that I tried:
POST http://locahost:9200/index/nonNestedDateType,nestedDateType/_search?size=5000
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"or": [
{
"range": {
"date": {
"gte": "2015-04-01"
}
}
},
{
"range": {
"message.date": {
"gte": "2015-04-01"
}
}
}
]
}
]
}
}
}
},
"sort": {
"_script": {
"script": "doc.containsKey('message') ? doc.message.date.value : doc.date.value",
"type": "number",
"order": "desc"
}
}
}
and these were the result that I got :
{
"took": 60,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 15,
"max_score": null,
"hits": [
{
"_index": "***",
"_type": "nonNestedDateType",
"_id": "***",
"_score": null,
"_source": {
"docId": "***",
"date": 1461634484557
},
"sort": [
1461634484557
]
},
{
"_index": "***",
"_type": "nonNestedDateType",
"_id": "***",
"_score": null,
"_source": {
"docId": "***",
"date": 1461634483528
},
"sort": [
1461634483528
]
},
{
"_index": "***",
"_type": "nestedDateType",
"_id": "***",
"_score": null,
"_source": {
"docId": "***",
"message": {
"date": 1461548078310
}
},
"sort": [
0
]
}
]
}
}
as you can see from the last result of the type nestedDateType, I was expecting the sort = 1461548078310 instead of 0. Could anyone explains to me what I was doing wrong?
noted that some fields have been removed for confidentiality.
i can finally make it works by changing
script": "doc.containsKey('message') ? doc.message.date.value : doc.date.value"
into
script": "doc.date.value == 0 ? doc['message.date'].value : doc.date.value"
still curious though why doc.containsKey('message') never return true