I want to calculate the difference of nested aggregations between two dates.
To be more concrete is it possible to calculate the difference between date_1.buckets.field_1.buckets.field_2.buckets.field_3.value - date_2.buckets.field_1.buckets.field_2.buckets.field_3.value given the below request/response. Is that possible with elasticsearch v.1.0.1?
The aggregation query request looks like this:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"date": [
"2014-08-18 00:00:00.0",
"2014-08-15 00:00:00.0"
]
}
}
]
}
}
}
},
"aggs": {
"date_1": {
"filter": {
"terms": {
"date": [
"2014-08-18 00:00:00.0"
]
}
},
"aggs": {
"my_agg_1": {
"terms": {
"field": "field_1",
"size": 2147483647,
"order": {
"_term": "desc"
}
},
"aggs": {
"my_agg_2": {
"terms": {
"field": "field_2",
"size": 2147483647,
"order": {
"_term": "desc"
}
},
"aggs": {
"my_agg_3": {
"sum": {
"field": "field_3"
}
}
}
}
}
}
}
},
"date_2": {
"filter": {
"terms": {
"date": [
"2014-08-15 00:00:00.0"
]
}
},
"aggs": {
"my_agg_1": {
"terms": {
"field": "field_1",
"size": 2147483647,
"order": {
"_term": "desc"
}
},
"aggs": {
"my_agg_1": {
"terms": {
"field": "field_2",
"size": 2147483647,
"order": {
"_term": "desc"
}
},
"aggs": {
"my_agg_3": {
"sum": {
"field": "field_3"
}
}
}
}
}
}
}
}
}
}
And the response looks like this:
{
"took": 236,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1646,
"max_score": 0,
"hits": []
},
"aggregations": {
"date_1": {
"doc_count": 823,
"field_1": {
"buckets": [
{
"key": "field_1_key_1",
"doc_count": 719,
"field_2": {
"buckets": [
{
"key": "key_1",
"doc_count": 275,
"field_3": {
"value": 100
}
}
]
}
}
]
}
},
"date_2": {
"doc_count": 823,
"field_1": {
"buckets": [
{
"key": "field_1_key_1",
"doc_count": 719,
"field_2": {
"buckets": [
{
"key": "key_1",
"doc_count": 275,
"field_3": {
"value": 80
}
}
]
}
}
]
}
}
}
}
Thank you.
With elasticsearch new version (eg: 5.6.9) is possible:
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"range": {
"date_created": {
"gte": "2018-06-16T00:00:00+02:00",
"lte": "2018-06-16T23:59:59+02:00"
}
}
}
]
}
}
}
},
"aggs": {
"by_millisec": {
"range" : {
"script" : {
"lang": "painless",
"source": "doc['date_delivered'][0] - doc['date_created'][0]"
},
"ranges" : [
{ "key": "<1sec", "to": 1000.0 },
{ "key": "1-5sec", "from": 1000.0, "to": 5000.0 },
{ "key": "5-30sec", "from": 5000.0, "to": 30000.0 },
{ "key": "30-60sec", "from": 30000.0, "to": 60000.0 },
{ "key": "1-2min", "from": 60000.0, "to": 120000.0 },
{ "key": "2-5min", "from": 120000.0, "to": 300000.0 },
{ "key": "5-10min", "from": 300000.0, "to": 600000.0 },
{ "key": ">10min", "from": 600000.0 }
]
}
}
}
}
No arithmetic operations are allowed between two aggregations' result from elasticsearch DSL, not even using scripts. (Upto version 1.1.1, at least I know)
Such operations need to be handeled in client side after processing the aggs result.
Reference
elasticsearch aggregation to sort by ratio of aggregations
In 1.0.1 I couldn't find anything but in 1.4.2 you could try scripted_metric aggregation (still experimental).
Here are the scripted_metric documentation page
I am not good with the elasticsearch syntax but I think your metric inputs would be:
init_script- just initialize a accumulator for each date:
"init_script": "_agg.d1Val = 0; _agg.d2Val = 0;"
map_script- test the date of the document and add to the right accumulator:
"map_script": "if (doc.date == firstDate) { _agg.d1Val += doc.field_3; } else { _agg.d2Val = doc.field_3;};",
reduce_script - accumulate intermediate data from various shards and return the final results:
"reduce_script": "totalD1 = 0; totalD2 = 0; for (agg in _aggs) { totalD1 += agg.d1Val ; totalD2 += agg.d2Val ;}; return totalD1 - totalD2"
I don't think that in this case you need a combine_script.
If course, if you can't use 1.4.2 than this is no help :-)
Related
I would like to build daily/monthly aggregation query.
Is the only solution to create ranges -> from... to... for each day/month? I can generate ranges but it seems that it can be other way to achieve that.
How can I format from... to... epoch milis to yyyy-mm-dd for each result?
{
"aggs": {
"aggs_sum_amount": {
"filters": {
"filters": {
"Amount1": {
...
},
"Amount2": {
...
}
}
},
"aggs": {
"range": {
"date_range": {
"field": "dateField",
"ranges": [
{
"from": "1613347200000",
"to": "1613433600000"
},
{
"from": "1613433600000",
"to": "1613520000000"
}
...
]
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
}
}
Example response
{
"aggregations": {
"aggs_sum_amountPLN": {
"buckets": {
"Amount1": {
"doc_count": 26,
"range": {
"buckets": [
{
"key": "1613347200000-1613433600000",
"from": 1.6133472E12,
"from_as_string": "1613347200000",
"to": 1.6134336E12,
"to_as_string": "1613433600000",
"doc_count": 0,
"sum_amount": {
"value": 0.0
}
},
{
...
}
]
}
},
"Amount2": {
...
}
}
}
}
}
You can use the date_histogram aggregation
It lets you specify a range and an interval for which you want to get the different buckets for.
This example on the linked page is quite self explanatory.
I've updated it to match your use-case:
POST /sales/_search?size=0
{
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "date",
"calendar_interval": "1M",
"format": "yyyy-MM-dd"
}
}
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
Response should be something like this:
{
...
"aggregations": {
"sales_over_time": {
"buckets": [
{
"key_as_string": "2015-01-01",
"key": 1420070400000,
"doc_count": 3,
"sum_amount": {
"value": 15.0
}
},
{
"key_as_string": "2015-02-01",
"key": 1422748800000,
"doc_count": 2,
"sum_amount": {
"value": 10.0
}
},
{
"key_as_string": "2015-03-01",
"key": 1425168000000,
"doc_count": 2,
"sum_amount": {
"value": 25.0
}
}
]
}
}
}
Elasticsearch newbie here. I have a series of log messages like these
{
"#timestamp": "whatever",
"type": "toBeMonitored",
"success": true
}
I was tasked to react on a change of -30% of the total amount of successful messages compared to yesterday's same interval. So if I do the check at 8 AM today, I should compare today's total count from midnight to 8 AM to yesterday's same interval.
I tried creating a date histogram aggregation but I would like to have the diff percentage as a query result and not do the math on the development side.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"histo": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "1h"
}
}
}
}
Any idea on how this might be accomplished?
You can leverage the derivative pipeline aggregation to achieve exactly what you expect:
POST /sales/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"histo": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "1h"
},
"aggs": {
"successDiff": {
"derivative": {
"buckets_path": "_count"
}
}
}
}
}
}
In each bucket you're going to get the difference between the document count in the previous bucket vs the current bucket.
Ended up dropping the date_histogram aggregation and using the date_range one. It's much easier to work with, even though it does not return the difference compared to yesterday's same time period. I did that in code.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"type": "toBeMonitored"
}
},
{
"term": {
"status": true
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d/d",
"lte": "now/h"
}
}
}
]
}
},
"aggs": {
"ranged_documents": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"key": "yesterday",
"from": "now-1d/d",
"to": "now-24h/h"
},
{
"key": "today",
"from": "now/d",
"to": "now/h"
}
],
"keyed": true
}
}
}
}
This query would yield a result similar to the one below
{
"_shards": {
"total": 42,
"failed": 0,
"successful": 42,
"skipped": 0
},
"hits": {
"hits": [],
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null
},
"took": 134,
"timed_out": false,
"aggregations": {
"ranged_documents": {
"buckets": {
"yesterday": {
"from_as_string": "2020-10-12T00:00:00.000Z",
"doc_count": 268300,
"to_as_string": "2020-10-12T12:00:00.000Z",
"from": 1602460800000,
"to": 1602504000000
},
"today": {
"from_as_string": "2020-10-13T00:00:00.000Z",
"doc_count": 251768,
"to_as_string": "2020-10-13T12:00:00.000Z",
"from": 1602547200000,
"to": 1602590400000
}
}
}
}
}
I would like to execute this kind of operation in painless :
if (_value >= 'c)' {
return _value
} else {
return '__BAD__'
}
value is a string and I would like this following behaviour :
if value is foo I want to replace it with __BAD__ if the value is bar, I want to keep bar. only values alphabetically after 'c' should be set to __BAD__.
I got this exception :
"lang": "painless",
"caused_by": {
"type": "class_cast_exception",
"reason": "Cannot apply [>] operation to types [java.lang.String] and [java.lang.String]."
}
Is there a way to perform string alphabetical comparaison between string in painless ?
My documents are looking :
{
"id": "doca",
"categoryId": "aaa",
"parentNames": "a$aa$aaa"
},
{
"id": "docb",
"categoryId": "bbb",
"parentNames": "a$aa$bbb"
},
{
"id": "docz",
"categoryId": "zzz",
"parentNames": "a$aa$zzz"
}
and my query is like :
{
"query": {
"bool": {
"filter": []
}
},
"size": 0,
"aggs": {
"catNames": {
"terms": {
"size": 10000,
"order": {
"_key": "asc"
},
"script": {
"source": "if(doc['parentNames'].value < 'a$aa$ccc') {return doc['parentNames'].value} return '__BAD__'",
"lang": "painless"
}
},
"aggs": {
"sort": {
"bucket_sort": {
"size": 2
}
},
"catId": {
"terms": {
"field": "categoryId",
"size": 1
}
}
}
}
}
}
I am expecting the result :
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"catNames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "__BAD__",
"doc_count": 1,
"catId": {
"buckets": [
{
"key": "aaa",
"doc_count": 1
}
]
}
},
{
"key": "a$aa$bbb",
"doc_count": 1,
"catId": {
"buckets": [
{
"key": "bbb",
"doc_count": 1
}
]
}
},
{
"key": "a$aa$zzz",
"doc_count": 1,
"catId": {
"buckets": [
{
"key": "zzz",
"doc_count": 1
}
]
}
}
]
}
}
}
In fact, I can use the compareTo function of java.lang.String.
if (_value.compareTo('c') > 0) {
return _value
} else {
return '__BAD__'
}
My query is becoming :
{
"query": {
"bool": {
"filter": []
}
},
"size": 0,
"aggs": {
"catNames": {
"terms": {
"size": 10000,
"order": {
"_key": "asc"
},
"script": {
"source": "if(doc['parentNames'].value.compareTo('a$aa$ccc')) {return doc['parentNames'].value} return '__BAD__'",
"lang": "painless"
}
},
"aggs": {
"sort": {
"bucket_sort": {
"size": 2
}
},
"catId": {
"terms": {
"field": "categoryId",
"size": 1
}
}
}
}
}
}
I'm using the following terms aggregations to get views and clicks of each campaign ( by campaign_id ) :
"aggregations": {
"campaigns": {
"terms": {
"field": "campaign_id",
"size": 10,
"order": {
"_term": "asc"
}
},
"aggregations": {
"actions": {
"terms": {
"field": "action",
"size": 10
}
}
}
}}
This is the response I get:
"aggregations": {
"campaigns": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "someId",
"doc_count": 12,
"actions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "click",
"doc_count": 3
},
{
"key": "view",
"doc_count": 9
}
]
}
}
]
}
}
EDIT:
Here is an example of a document ( only the relevant parts of it..):
{
"_index": "action",
"_type": "click",
"_id": "AVI2XOTl8otXlszOjypT",
"_score": 1,
"_source": {
"ip": "127.0.0.1",
"timestamp": "2016-01-12T15:03:23.622743524Z",
"action": "click",
"campaign_id": "IypmiroC"
}}
I need to be able to retrieve the conversion rate of each campaign ( clicks / views ) , and I can't do it on the client side since I need to be able to sort by conversion rate.
Any help would be much appreciated.
This will require use of various aggregations and ES 2.x. First I am getting all unique campaign_id with terms aggregation. Then I am filtering with actions and getting the count of documents with that particular action. Then You need to use pipeline aggregation introduced in ES 2.0, mainly bucket script aggregation to take the ratio. This is how it looks.
{
"size": 0,
"aggs": {
"unique_campaign": {
"terms": {
"field": "campaign_id",
"size": 10
},
"aggs": {
"click_bucket": {
"filter": {
"term": {
"action": "click"
}
},
"aggs": {
"click_count": {
"value_count": {
"field": "action"
}
}
}
},
"view_bucket": {
"filter": {
"term": {
"action": "view"
}
},
"aggs": {
"view_count": {
"value_count": {
"field": "action"
}
}
}
},
"conversion_ratio": {
"bucket_script": {
"buckets_path": {
"total_clicks": "click_bucket>click_count",
"total_views": "view_bucket>view_count"
},
"script": "total_clicks/total_views"
}
}
}
}
}
}
Also, you need to have not_analyzed mapping for action as Click wont match click.
Hope this helps!!
As for now 7.x, sorting can be achieved as follows, just a demo for reference:
bucket_script
bucket_sort
{
"size": 0,
"aggs": {
"mallBucket": {
"terms": {
"field": "mallId",
"size": 20,
"min_doc_count": 3,
"shard_size": 10000
},
"aggs": {
"totalOrderCount": {
"value_count": {
"field": "orderSn"
}
},
"filteredCoupon": {
"filter": {
"terms": {
"tags": [
"hello",
"cool"
]
}
},
"aggs": {
"couponCount": {
"value_count": {
"field": "orderSn"
}
}
}
},
"countRatio": {
"bucket_script": {
"buckets_path": {
"orderCount": "totalOrderCount",
"couponCount": "filteredCoupon>couponCount"
},
"script": "params.couponCount/params.orderCount"
}
},
"ratio_bucket_sort": {
"bucket_sort": {
"sort": [
{
"countRatio": {
"order": "desc"
}
}
],
"size": 20
}
}
}
}
}
}
I have a query as follows:
{
"size": 0,
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"match": {
"_type": "grx-ipx"
}
},
{
"range": {
"#timestamp": {
"gte": "2015-09-08T15:00:00.000Z",
"lte": "2015-09-08T15:10:00.000Z"
}
}
}
]
}
},
"filter": {
"and": [
{
"terms": {
"inSightCustID": [
"ASD001",
"ZXC049"
]
}
},
{
"terms": {
"reportFamily": [
"GRXoIPX",
"LTEoIPX"
]
}
}
]
}
}
},
"_source": [
"inSightCustID",
"fiveMinuteIn",
"reportFamily",
"#timestamp"
],
"aggs": {
"timestamp": {
"terms": {
"field": "#timestamp",
"size": 5
},
"aggs": {
"reportFamily": {
"terms": {
"field": "reportFamily"
},
"aggs": {
"averageFiveMinute": {
"avg": {
"field": "fiveMinuteIn"
}
}
}
}
}
},
"distinct_timestamps": {
"cardinality": {
"field": "#timestamp"
}
}
}
}
This result of this query looks like:
...
"aggregations": {
"distinct_timestamps": {
"value": 3,
"value_as_string": "1970-01-01T00:00:00.003Z"
},
"timestamp": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1441724700000,
"key_as_string": "2015-09-08T15:05:00.000Z",
"doc_count": 10,
"reportFamily": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "GRXoIPX",
"doc_count": 5,
"averageFiveMinute": {
"value": 1687.6
}
},
{
"key": "LTEoIPX",
"doc_count": 5,
"averageFiveMinute": {
"value": 56710.6
}
}
]
}
},
...
What I want to do is for each bucket in the reportFamily aggregation, I want to show the sum of the averageFiveMinute values. So for instance, in the example above, I would also like to show the sum of 1687.6 and 56710.6. I want to do this for all reportFamily aggregations.
Here is what I have tried:
{
"size": 0,
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"match": {
"_type": "grx-ipx"
}
},
{
"range": {
"#timestamp": {
"gte": "2015-09-08T15:00:00.000Z",
"lte": "2015-09-08T15:10:00.000Z"
}
}
}
]
}
},
"filter": {
"and": [
{
"terms": {
"inSightCustID": [
"ASD001",
"ZXC049"
]
}
},
{
"terms": {
"reportFamily": [
"GRXoIPX",
"LTEoIPX"
]
}
}
]
}
}
},
"_source": [
"inSightCustID",
"fiveMinuteIn",
"reportFamily",
"#timestamp"
],
"aggs": {
"timestamp": {
"terms": {
"field": "#timestamp",
"size": 5
},
"aggs": {
"reportFamily": {
"terms": {
"field": "reportFamily"
},
"aggs": {
"averageFiveMinute": {
"avg": {
"field": "fiveMinuteIn"
}
}
}
},
"sum_AvgFiveMinute": {
"sum_bucket": {
"buckets_path": "reportFamily>averageFiveMinute"
}
}
}
},
"distinct_timestamps": {
"cardinality": {
"field": "#timestamp"
}
}
}
}
I have added:
"sum_AvgFiveMinute": {
"sum_bucket": {
"buckets_path": "reportFamily>averageFiveMinute"
}
}
But unfortunately, this triggers an exception Parse Failure [Could not find aggregator type [sum_bucket] in [sum_AvgFiveMinute]
I expected the results to be something like:
...
"aggregations": {
"distinct_timestamps": {
"value": 3,
"value_as_string": "1970-01-01T00:00:00.003Z"
},
"timestamp": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1441724700000,
"key_as_string": "2015-09-08T15:05:00.000Z",
"doc_count": 10,
"reportFamily": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "GRXoIPX",
"doc_count": 5,
"averageFiveMinute": {
"value": 1687.6
}
},
{
"key": "LTEoIPX",
"doc_count": 5,
"averageFiveMinute": {
"value": 56710.6
}
}
]
},
"sum_AvgFiveMinute": {
"value": 58398.2
}
},
...
What is wrong with this query and how can I achieve the expected result?
Here is a link to the sum bucket aggregation docs.
Many thanks for the help.