Stats Aggregation with Min Mode in ElasticSearch - elasticsearch

I have the below mapping in ElasticSearch
{
"properties":{
"Costs":{
"type":"nested",
"properties":{
"price":{
"type":"integer"
}
}
}
}
}
So every document has an Array field Costs, which contains many elements and each element has price in it. I want to find the min and max price with the condition being - that from each array the element with the minimum price should be considered. So it is basically min/max among the minimum value of each array.
Lets say I have 2 documents with the Costs field as
Costs: [
{
"price": 100,
},
{
"price": 200,
}
]
and
Costs: [
{
"price": 300,
},
{
"price": 400,
}
]
So I need to find the stats
This is the query I am currently using
{
"costs_stats":{
"nested":{
"path":"Costs"
},
"aggs":{
"price_stats_new":{
"stats":{
"field":"Costs.price"
}
}
}
}
}
And it gives me this:
"min" : 100,
"max" : 400
But I need to find stats after taking minimum elements of each array for consideration.
So this is what i need:
"min" : 100,
"max" : 300
Like we have a "mode" option in sort, is there something similar in stats aggregation also, or any other way of achieving this, maybe using a script or something. Please suggest. I am really stuck here.
Let me know if anything is required
Update 1:
Query for finding min/max among minimums
{
"_source":false,
"timeout":"5s",
"from":0,
"size":0,
"aggs":{
"price_1":{
"terms":{
"field":"id"
},
"aggs":{
"price_2":{
"nested":{
"path":"Costs"
},
"aggs":{
"filtered":{
"aggs":{
"price_3":{
"min":{
"field":"Costs.price"
}
}
},
"filter":{
"bool":{
"filter":{
"range":{
"Costs.price":{
"gte":100
}
}
}
}
}
}
}
}
}
},
"minValue":{
"min_bucket":{
"buckets_path":"price_1>price_2>filtered>price_3"
}
}
}
}
Only few buckets are coming and hence the min/max is coming among those, which is not correct. Is there any size limit.

One way to achieve your use case is to add one more field id, in each document. With the help of id field terms aggregation can be performed, and so buckets will be dynamically built - one per unique value.
Then, we can apply min aggregation, which will return the minimum value among numeric values extracted from the aggregated documents.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"Costs": {
"type": "nested"
}
}
}
}
Index Data:
{
"id":1,
"Costs": [
{
"price": 100
},
{
"price": 200
}
]
}
{
"id":2,
"Costs": [
{
"price": 300
},
{
"price": 400
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
}
Using stats aggregation also, it can be achieved (if you add one more field id that uniquely identifies your document)
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"costs_stats": {
"nested": {
"path": "Costs"
},
"aggs": {
"price_stats_new": {
"stats": {
"field": "Costs.price"
}
}
}
}
}
}
}
}
Update 1:
To find the maximum value among those minimums (as seen in the above query), you can use max bucket aggregation
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id",
"size": 15 <-- note this
},
"aggs": {
"nested_entries": {
"nested": {
"path": "Costs"
},
"aggs": {
"min_position": {
"min": {
"field": "Costs.price"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": 2,
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 300.0
}
}
}
]
},
"maxValue": {
"value": 300.0,
"keys": [
"2"
]
}
}

Related

Elasticsearch aggregation with unqiue counting

My documents consist of a history of orders and their state, here a minimal example:
{
"orderNumber" : "xyz",
"state" : "shipping",
"day" : "2022-07-20",
"timestamp" : "2022-07-20T15:06:44.290Z",
}
the state can be strings like shipping, processing, redo,...
For every possible state, I need to count the number of orders that had this state at some point during a day, without counting a state twice for the same orderNumber that day (which can happen if there is a problem and it needs to start from the beginning that same day).
My aggregation looks like this:
GET order-history/_search
{
"aggs": {
"countDays": {
"terms": {
"field": "day",
"order": {
"_key": "desc"
},
"size": 20
},
"aggs": {
"countStates": {
"terms": {
"field": "state.keyword",
"size": 10
}
}
}
}
}
, "size": 1
}
However, this will count a state for a given orderNumber twice if it reappears that same day. How would I prevent it from counting a state twice for each orderNumber, if it is on the same day?
Tldr;
I don't think there is a flexible and simple solution.
But if you know in advance the number of state that exists. Maybe through another aggregation query, to get all type of state.
You could do the following
POST /_bulk
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"redo","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"processing","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"shipping","day":"2022-07-20"}
GET 73138766/_search
{
"size": 0,
"aggs": {
"per_day": {
"date_histogram": {
"field": "day",
"calendar_interval": "day"
},
"aggs": {
"shipping": {
"filter": { "term": { "state.keyword": "shipping" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"processing": {
"filter": { "term": { "state.keyword": "processing" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"redo": {
"filter": { "term": { "state.keyword": "redo" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
}
}
}
}
}
You will obtain the following results
{
"aggregations": {
"per_day": {
"buckets": [
{
"key_as_string": "2022-07-20T00:00:00.000Z",
"key": 1658275200000,
"doc_count": 5,
"shipping": {
"doc_count": 3,
"orders": {
"value": 2
}
},
"processing": {
"doc_count": 1,
"orders": {
"value": 1
}
},
"redo": {
"doc_count": 1,
"orders": {
"value": 1
}
}
}
]
}
}
}

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.
You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

ElasticSearch cardinality aggregation with multiple query

I have a document with merchant and item. my document will look liken
{
"merchant": "M1",
"item": "I1"
}
For the given list of merchant names, I want to get number of unique items on each merchant.
I was able to get number of unique items on a given merchant by following query:
{
"size": 0,
"query": {
"match": {
"merchant": "M1"
}
},
"aggs": {
"count_unique_items": {
"cardinality": {
"field": "I1"
}
}
}
}
Is there a way to expand this query so instead of 1 merchant, I can do search for N merchants with one query?
You need to use terms query to match multiple merchants and use multilevel aggregation to find unique count per merchant. So create a terms aggregation for merchant and then add cardinality aggregation as sub aggregation to the terms aggregation. Query will look like below:
{
"size": 0,
"query": {
"terms": {
"merchant": [
"M1",
"M2"
]
}
},
"aggs": {
"merchent": {
"terms": {
"field": "merchant"
},
"aggs": {
"item_count": {
"cardinality": {
"field": "item"
}
}
}
}
}
}
As suggested by #Opster ES Ninja Nishant, you need to use multilevel aggregation.
Adding a working example with index data,search query, and search result
Index Data:
{
"merchant": "M3",
"item": ["I3","I2"]
}
{
"merchant": "M2",
"item": ["I2","I2"]
}
{
"merchant": "M1",
"item": "I1"
}
Search Query:
To count the unique number of item for a given merchant, in the cardinality aggregation instead of I1, you should use the item field
{
"size":0,
"query": {
"terms": {
"merchant.keyword": [
"M1",
"M2",
"M3"
]
}
},
"aggs": {
"merchent": {
"terms": {
"field": "merchant.keyword"
},
"aggs": {
"item_count": {
"cardinality": {
"field": "item.keyword" <-- note this
}
}
}
}
}
}
Search Result:
"aggregations": {
"merchent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M1",
"doc_count": 1,
"item_count": {
"value": 1
}
},
{
"key": "M2",
"doc_count": 1,
"item_count": {
"value": 1
}
},
{
"key": "M3",
"doc_count": 1,
"item_count": {
"value": 2
}
}
]
}

How to get hours between Min and Max date in Elasticsearch Aggregation?

How can I calculate hours between max and min dates (same tree level of max and min) in Elasticsearch?
My Query:-
{
"size": 0,
"query": {
"bool": {
"must": []
}
},
"aggs": {
"group_by_areaId": {
"terms": {
"size": 100000,
"field": "areaId.keyword"
},
"aggs": {
"4m": {
"date_histogram": {
"field": "timestamp",
"format": "yyyy-MM-dd'T'HH:mm:ssZZ",
"interval": "4m",
"order": {
"_key": "asc"
}
},
"aggs": {
"maxDate": {
"max": {
"field": "timestamp"
}
},
"minDate": {
"min": {
"field": "timestamp"
}
}
}
}
}
}
}
}
And the response (short) as,
"aggregations": {
"group_by_areaId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "key1",
"doc_count": 15,
"4m": {
"buckets": [
{
"key_as_string": "2020-02-12T06:08:00+0000",
"key": 1581487680000,
"doc_count": 3,
"minDate": {
"value": 1.581487847E12,
"value_as_string": "2020-02-12T06:10:47Z"
},
"maxDate": {
"value": 1.58148791E12,
"value_as_string": "2020-02-12T06:11:50Z"
},
*// Need hours between maxDate and minDate here
//{
// "hours" : "0.0175" (maxDate-minDate)
//}*
}
]
}
}
]
}
}
Anyone please help me to find out the solution?
Thanks in Advance.
You can leverage the bucket_script pipeline aggregation in order to compute the difference between min and max for each bucket.
Simply add the following at the same level as minDate and maxDate:
"hours": {
"bucket_script": {
"buckets_path": {
"min": "minDate",
"max": "maxDate"
},
"script": "(params.max - params.min) / 3600000"
}
}
For your sample data above, the result in this case would be 0.0175 (i.e. roughly 1 minute)

Elastic aggregation to identify period A vs B percentage increases

I have some daily sales data indexed into Elasticsearch. I successfully run a number of aggregations to identify top sellers across a date range etc.
I am now trying to write a single query to do the following:
Identify Top n sellers over a date range (Period A)
Take the results of Period A and sum sales for these products over second date range (Period B)
Compare sales in period A to Period B and identify those with percentage increases above X%.
My attempt so far:
{
"query": {
"bool": {
"filter": [
{
"range": {
"date": {
"gte": "2017-10-01",
"lte": "2017-10-14"
}
}
}
]
}
},
"size": 0,
"aggs": {
"data_split": {
"terms": {
"size": 10,
"field": "product_id"
},
"aggs": {
"date_periods": {
"date_range": {
"field": "date",
"format": "YYYY-MM-dd",
"ranges": [
{
"from": "2017-10-01",
"to": "2017-10-07"
},
{
"from": "2017-10-08",
"to": "2017-10-14"
}
]
},
"aggs": {
"product_id_split": {
"terms": {
"field": "product_id"
},
"aggs": {
"unit_sum": {
"sum": {
"field": "units"
}
}
}
}
}
}
}
}
}
}
Although this outputs results for two periods, I don't think this is quite what I want as the initial filter is running from Period A start date to Period B end date and I think summing results for that range instead of Period A only. I also don't get the % comparison, I would probably do this at my application level, but I understand could be handled with a scripted Elastic query?
It would be especially awesome if instead of top n results in period A, I could set a sales threshold of say 1,000 sales.
Any pointers would be much appreciated. Thanks in advance!
Currently running Elastic 5.6
{
"query": {
"bool": {
"filter": [
{
"range": {
"date": {
"gte": "2017-10-01",
"lte": "2017-10-14"
}
}
}
]
}
},
"size": 0,
"aggs": {
"data_split": {
"terms": {
"size": 10,
"field": "product_id"
},
"aggs": {
"date_period1": {
"filter": {
"range": {
"date": {
"gte": "2017-10-01",
"lte": "2017-10-07"
}
}
},
"aggs": {
"unit_sum": {
"sum": {
"field": "units"
}
}
}
},
"date_period2": {
"filter": {
"range": {
"date": {
"gte": "2017-10-08",
"lte": "2017-10-14"
}
}
},
"aggs": {
"unit_sum": {
"sum": {
"field": "units"
}
}
}
},
"percentage_increase": {
"bucket_script": {
"buckets_path": {
"firstPeriod": "date_period1>unit_sum",
"secondPeriod": "date_period2>unit_sum"
},
"script": "(params.secondPeriod-params.firstPeriod)*100/params.firstPeriod"
}
},
"retain_buckets": {
"bucket_selector": {
"buckets_path": {
"percentage": "percentage_increase"
},
"script": "params.percentage > 5"
}
}
}
}
}
}
And a full test data in this gist.
The result of this aggregation is giving you this:
"aggregations": {
"data_split": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A",
"doc_count": 6,
"date_period1": {
"doc_count": 3,
"unit_sum": {
"value": 150
}
},
"date_period2": {
"doc_count": 3,
"unit_sum": {
"value": 160
}
},
"percentage_increase": {
"value": 6.666666666666667
}
},
{
"key": "C",
"doc_count": 2,
"date_period1": {
"doc_count": 1,
"unit_sum": {
"value": 50
}
},
"date_period2": {
"doc_count": 1,
"unit_sum": {
"value": 70
}
},
"percentage_increase": {
"value": 40
}
}
]
}
}
The idea is that you use two filter type of aggregations for the two date intervals. And for each you calculate a sum. Then, using a third aggregation of type bucket_script you calculate the percentage increase (note, though, that it will be a negative number of there is a decrease in sales for example).
Then, using yet another aggregation - of type bucket_selector - you keep the product_ids where the percentage is larger than 5%.

Resources