Select TOP + GROUP BY + SHORT in Elasticsearch? - elasticsearch

Assume the following stockInWarehouse schema:
{
product_db: {
mappings: {
stockInWarehouse: {
properties: {
sku: {
type: "string"
},
arrivalTime: {
type: "date",
format: "dateOptionalTime"
}
}
}
}
}
}
The data in stockInWarehouse look like:
{
"hits": {
"total": 5,
"hits": [
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "1",
"_source": {
"sku": "item 1",
"arrivalTime": "2015-11-11T19:00:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "2",
"_source": {
"sku": "item 2",
"arrivalTime": "2015-11-12T19:00:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "3",
"_source": {
"sku": "item 1",
"arrivalTime": "2015-11-12T19:35:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "4",
"_source": {
"sku": "item 1",
"arrivalTime": "2015-11-13T19:56:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "5",
"_source": {
"sku": "item 3",
"arrivalTime": "2015-11-15T19:56:10.231Z"
}
}
]
}
}
What i am trying to do is to fetch TOP documents by arrivalTime (aka most recent documents) however i want them to be sorted by another field (sku) and limit to available sku. The expected result would look like this:
{
"hits": {
"total": 3,
"hits": [
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "5",
"_source": {
"sku": "item 3",
"arrivalTime": "2015-11-15T19:56:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "4",
"_source": {
"sku": "item 1",
"arrivalTime": "2015-11-13T19:56:10.231Z"
}
},
{
"_index": "product_db",
"_type": "stockInWarehouse",
"_id": "2",
"_source": {
"sku": "item 2",
"arrivalTime": "2015-11-12T19:00:10.231Z"
}
}
]
}
}
If I sort by arrivalTime, the result sku list will contains item 3, item 1, item 1, item 2, item 1 (duplicate). If I sort by sku, result list will not reflect correct arrivalTime order.
Is this type of query possible in Elasticsearch? How can I archive this?

How about this one?
{
"size": 0,
"aggs": {
"terms_agg": {
"terms": {
"field": "sku",
"size": 100,
"order": {
"max_date_agg": "desc"
}
},
"aggs": {
"max_date_agg": {
"max": {
"field": "arrivalTime"
}
}
}
}
}
}
I have made size : 100 assuming you have lot of products.
Note You need to add index : not_analyzed to your mapping of sku
This is the result of the query
"aggregations": {
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "item 3",
"doc_count": 1,
"max_date_agg": {
"value": 1447617370231,
"value_as_string": "2015-11-15T19:56:10.231Z"
}
},
{
"key": "item 1",
"doc_count": 3,
"max_date_agg": {
"value": 1447444570231,
"value_as_string": "2015-11-13T19:56:10.231Z"
}
},
{
"key": "item 2",
"doc_count": 1,
"max_date_agg": {
"value": 1447354810231,
"value_as_string": "2015-11-12T19:00:10.231Z"
}
}
]
}
}
I hope it helps!!

Related

Sorting aggregated data in elastic search

I am doing a search that is doing an aggregation by xyz field and getting the latest version. Now I need to sort the aggregated data based on created field. Let me know how we can do that.
{
"query": {
"query_string": {
"query": ""
}
},
"aggs": {
"uuid": {
"terms": {
"field": "xyz.keyword"
},
"aggs": {
"top_trades_hits": {
"top_hits": {
"sort": [
{
"version": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
the Above mentioned query returns
{
"aggregations": {
"uuid": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "794a5b8f-3e22-4ff9-98bb-b8b54c85948e",
"doc_count": 3,
"agg": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "index",
"_type": "doc",
"_id": "7",
"_score": null,
"_source": {
"uuid": "794a5b8f-3e22-4ff9-98bb-b8b54c85948e",
"type": "qsdn",
"discontinued": false,
"minSupportedPlatformVersion": "11.5.3.3",
"version": 2,
"created": 1658428291346
},
"sort": [
2
]
}
]
}
}
},
{
"key": "03504029-a029-417d-bd67-fb1b5fc5055b",
"doc_count": 2,
"agg": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "index",
"_type": "doc",
"_id": "9",
"_score": null,
"_source": {
"uuid": "03504029-a029-417d-bd67-fb1b5fc5055b",
"type": "gdsg",
"discontinued": false,
"version": 1.1,
"created": 1554904300799
},
"sort": [
1.1
]
}
]
}
}
}
]
}
}
}
Document for the elastic search is as follows
{
"_index": "index",
"_type": "doc",
"_id": "3",
"_version": 2,
"_seq_no": 1,
"_primary_term": 1,
"found": true,
"_source": {
"doc": {
"uuid": "abcd",
"type": "strifn",
"name": "default",
"version": 3.12,
"s3ObjectVersionId": "",
"created": 165842829134
}
}
}
Expected result
{
"aggregations": {
"uuid": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "03504029-a029-417d-bd67-fb1b5fc5055b",
"doc_count": 2,
"agg": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "index",
"_type": "doc",
"_id": "9",
"_score": null,
"_source": {
"uuid": "03504029-a029-417d-bd67-fb1b5fc5055b",
"type": "gdsg",
"discontinued": false,
"version": 1.1,
"created": 1554904300799
},
"sort": [
1.1
]
}
]
}
}
},
{
"key": "794a5b8f-3e22-4ff9-98bb-b8b54c85948e",
"doc_count": 3,
"agg": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "index",
"_type": "doc",
"_id": "7",
"_score": null,
"_source": {
"uuid": "794a5b8f-3e22-4ff9-98bb-b8b54c85948e",
"type": "qsdn",
"discontinued": false,
"minSupportedPlatformVersion": "11.5.3.3",
"version": 2,
"created": 1658428291346
},
"sort": [
2
]
}
]
}
}
}
]
}
}
}
I am using AWS opensearch for the same
Your query is correct only, you just need to increase the size from 1 to see all the documents in your bucket sorted according to version field in your Elasticsearch index.
Can you share more info, if above doesn't help you, like sample documents and index mapping.

How to make flattened sub-field in the nested field in elastic search?

Here, I have a indexed document like:
doc = {
"id": 1,
"content": [
{
"txt": I,
"time": 0,
},
{
"txt": have,
"time": 1,
},
{
"txt": a book,
"time": 2,
},
{
"txt": do not match this block,
"time": 3,
},
]
}
And I want to match "I have a book", and return the matched time: 0,1,2. Is there anyone who knows how to build the index and the query for this situation?
I think the "content.txt" should be flattened but "content.time" should be nested?
want to match "I have a book", and return the matched time: 0,1,2.
Adding a working example with index mapping,search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"content": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested": {
"path": "content",
"query": {
"bool": {
"must": [
{
"match": {
"content.txt": "I have a book"
}
}
]
}
},
"inner_hits": {}
}
}
}
Search Result:
"inner_hits": {
"content": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2.5226097,
"hits": [
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 2
},
"_score": 2.5226097,
"_source": {
"txt": "a book",
"time": 2
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 0
},
"_score": 1.5580825,
"_source": {
"txt": "I",
"time": 0
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 1
},
"_score": 1.5580825,
"_source": {
"txt": "have",
"time": 1
}
}
]
}
}
}
}

ElasticSearch 2.4 - inner_hits nested merge queries result

I'm using ElasticSearch 2.4
I need to get all Purchases that match all queries.
I'm actually using inner_hits function but it doesn´t works as expected because it only shows the match of the current nested query and the problem is the combination with main document query.
I have this mapping and bellow I created an example with my comments:
PUT /example_contact_purchases
{
"mappings": {
"contact": {
"dynamic": false,
"properties": {
"name": {
"type": "string"
},
"country": {
"type": "string"
},
"purchases": {
"type": "nested",
"properties": {
"uuid":{
"type":"string"
},
"brand":{
"type":"string"
}
}
}
}
}
}
}
POST example_contact_purchases/contact
{
"name" : "Fran",
"country": "ES",
"purchases" : [
{
"uuid" : "23",
"brand":"Sony"
},
{
"uuid":"23",
"brand":"Sony"
}
]
}
POST example_contact_purchases/contact
{
"name" : "Jhon",
"country": "UK",
"purchases" : [
{
"uuid" : "45",
"brand": "Lenovo"
},
{
"uuid":"23",
"brand":"Sony"
},
{
"uuid":"77",
"brand":"HP"
}
]
}
POST example_contact_purchases/contact
{
"name" : "Lucas",
"country": "ES",
"purchases" : [
{
"uuid" : "45",
"brand": "Lenovo"
},
{
"uuid":"23",
"brand":"Sony"
},
{
"uuid":"77",
"brand":"HP"
}
]
}
GET example_contact_purchases/contact/_search
{
"query": {
"bool": {
"should": [
{"bool": {
"must": [
{
"query_string": {
"query": "country:ES"
}
},
{
"nested": {
"path": "purchases",
"inner_hits":{
"name":"0"
},
"filter": {
"query": {
"query_string": {
"query": "(purchases.brand:Sony)"
}
}
}
}
}
]
}},
{"bool": {
"must": [
{
"query_string": {
"query": "country:UK"
}
},
{
"nested": {
"path": "purchases",
"inner_hits":{
"name":"1"
},
"filter": {
"query": {
"query_string": {
"query": "(purchases.uuid:45)"
}
}
}
}
}
]
}
}
]
}
}
}
I am using simple query like this:
"(country.raw:ES AND purchases.brand:Sony) OR (country:UK AND purchases.uuid:45)"
And the result of the search query is:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.5949223,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJJdZXthyTIlmcERM",
"_score": 0.5949223,
"_source": {
"name": "Jhon",
"country": "UK",
"purchases": [
{
"uuid": "45",
"brand": "Lenovo"
},
{
"uuid": "23",
"brand": "Sony"
},
{
"uuid": "77",
"brand": "HP"
}
]
},
"inner_hits": {
"0": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJJdZXthyTIlmcERM",
"_nested": {
"field": "purchases",
"offset": 1
},
"_score": 1,
"_source": {
"uuid": "23",
"brand": "Sony"
}
}
]
}
},
"1": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJJdZXthyTIlmcERM",
"_nested": {
"field": "purchases",
"offset": 0
},
"_score": 1,
"_source": {
"uuid": "45",
"brand": "Lenovo"
}
}
]
}
}
}
},
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJKBHXthyTIlmcERN",
"_score": 0.5949223,
"_source": {
"name": "Lucas",
"country": "ES",
"purchases": [
{
"uuid": "45",
"brand": "Lenovo"
},
{
"uuid": "23",
"brand": "Sony"
},
{
"uuid": "77",
"brand": "HP"
}
]
},
"inner_hits": {
"0": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJKBHXthyTIlmcERN",
"_nested": {
"field": "purchases",
"offset": 1
},
"_score": 1,
"_source": {
"uuid": "23",
"brand": "Sony"
}
}
]
}
},
"1": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJKBHXthyTIlmcERN",
"_nested": {
"field": "purchases",
"offset": 0
},
"_score": 1,
"_source": {
"uuid": "45",
"brand": "Lenovo"
}
}
]
}
}
}
},
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJI1SXthyTIlmcERL",
"_score": 0.5139209,
"_source": {
"name": "Fran",
"country": "ES",
"purchases": [
{
"uuid": "23",
"brand": "Sony"
},
{
"uuid": "23",
"brand": "Sony"
}
]
},
"inner_hits": {
"0": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJI1SXthyTIlmcERL",
"_nested": {
"field": "purchases",
"offset": 1
},
"_score": 1,
"_source": {
"uuid": "23",
"brand": "Sony"
}
},
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJI1SXthyTIlmcERL",
"_nested": {
"field": "purchases",
"offset": 0
},
"_score": 1,
"_source": {
"uuid": "23",
"brand": "Sony"
}
}
]
}
},
"1": {
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
}
}
]
}
}
Unfortunatly the first result is wrong:
"inner_hits": {
"0": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJJdZXthyTIlmcERM",
"_nested": {
"field": "purchases",
"offset": 1
},
"_score": 1,
"_source": {
"uuid": "23",
"brand": "Sony"
}
}
]
}
},
"1": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_contact_purchases",
"_type": "contact",
"_id": "AXFfJJdZXthyTIlmcERM",
"_nested": {
"field": "purchases",
"offset": 0
},
"_score": 1,
"_source": {
"uuid": "45",
"brand": "Lenovo"
}
}
]
}
}
}
It should show the purchase for Jhon UK with parameters:
{"uuid": "45","brand":"Lenovo"} ( inner_hits with name "1")
Thanks

Elasticsearch query to process log data

I have an event log of an e-commerce website in Elasticsearch.
Each event is a record in ES
{
"_index": "event_log",
"_type": "log_type",
"_id": "3ud-kmoBazYRVz7KCgIy",
"_score": 1,
"_source": {
"user_id": 123,
"event": "click",
"category": "abc",
"product_id": 1112
}
},
{
"_index": "event_log",
"_type": "log_type",
"_id": "4Od-kmoBazYRVz7KCgLr",
"_score": 1,
"_source": {
"user_id": 123,
"event": "click",
"category": "abc",
"product_id": 1118
}
},
{
"_index": "event_log",
"_type": "log_type",
"_id": "4ud-kmoBazYRVz7KkwL2",
"_score": 1,
"_source": {
"user_id": 123,
"event": "cart",
"category": "xyz",
"product_id": 1
}
},
{
"_index": "event_log",
"_type": "log_type",
"_id": "2ud-kmoBazYRVz7KCALB",
"_score": 1,
"_source": {
"user_id": 123,
"event": "cart",
"category": "xyz",
"product_id": 11
}
},
I want list of all the product_ids grouping event, category, user.
Expected output:
{"click": {
"abc": {
"123": {
"product_id": [1112, 1118]
}
}
},
"cart": {
"xyz": {
"123": {
"product_id": [1, 11]
}
}
}
}
I will be having millions of records in the index. Querying all the records and processing it is time-consuming. Is there a way to produce the output in a single query? I'm sure it is not possible to generate exactly in the given format. Something near to it is very useful.
Hi here is my suggestion (first try)
GET event_log/_search
{
"size": 0,
"aggs": {
"event": {
"terms": {
"field": "event"
},
"aggs": {
"category": {
"terms": {
"field": "category"
},
"aggs": {
"product_id": {
"terms": {
"field": "product_id"
}
}
}
}
}
}
}
}

How to correctly aggregate with the field is a list on Elasticsearch

Currently the ES logs are indexed in a way that some fields have a list instead of a single value.
For example:
_source:{
"field1":"["item1", "item2", "item3"],
"field2":"something",
"field3": "something_else"
}
Of course, the length of list is not always the same. I'm trying to find a way to aggregate the number of logs that consist each item (so some logs will be counted multiple times)
I know I have to use aggs, but how can I form the right query (after -d)?
You can use below query that uses terms aggregation and top_hits.
{
"size": 0,
"aggs": {
"group": {
"terms": {
"script": "_source.field1.each{}"
},
"aggs":{
"top_hits_log" :{
"top_hits" :{
}
}
}
}
}
}
Output will be:
"buckets": [
{
"key": "item1",
"doc_count": 3,
"top_hits_log": {
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"field1": [
"item1",
"item2",
"item3"
],
"field2": "something1"
}
},
{
"_index": "so",
"_type": "test",
"_id": "2",
"_score": 1,
"_source": {
"field1": [
"item1"
],
"field2": "something2"
}
},
{
"_index": "so",
"_type": "test",
"_id": "3",
"_score": 1,
"_source": {
"field1": [
"item1",
"item2"
],
"field2": "something3"
}
}
]
}
}
},
{
"key": "item2",
"doc_count": 2,
"top_hits_log": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"field1": [
"item1",
"item2",
"item3"
],
"field2": "something1"
}
},
{
"_index": "so",
"_type": "test",
"_id": "3",
"_score": 1,
"_source": {
"field1": [
"item1",
"item2"
],
"field2": "something3"
}
}
]
}
}
},
{
"key": "item3",
"doc_count": 1,
"top_hits_log": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"field1": [
"item1",
"item2",
"item3"
],
"field2": "something1"
}
}
]
}
}
}
]
Make sure to enable dynamic scripting. Set script.disable_dynamic: false
Hope this helps.
There is no need to use scripting. It will be slow especially _source parsing. You also need to make sure your field1 is not_analyzed or you will get weird results as terms aggregation is performed on unique tokens in Inverted Index.
{
"size": 0,
"aggs": {
"unique_items": {
"terms": {
"field": "field1",
"size": 100
},
"aggs": {
"documents": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Here the size is 100 inside terms aggregation, change this according to how many unique values you think you have(default is 10).
Hope this helps!

Resources