Elasticsearch average over date histogram buckets - elasticsearch

I've got a bunch of documents indexed in ElasticSearch, and I need to get the following data:
For each month, get the average number of documents per working day of the month (or if impossible, use 20 days as the default).
I already aggregated my data into months buckets using the date histogram aggregation. I tried to nest a stats bucket, but this aggregations uses data extracted from the document's field, not from the parent bucket.
Here is my query so far:
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
}
"aggs": {
'???': '???'
}
}
}
}
edit
To make my question clearer, what I need is:
Get the total of numbers of documents created for the month (which is already done thanks to the date_histogram aggregation)
Get the number of working days for the month
Divide the first by the second.

For anyone still interested, you can now do with with the avg_bucket aggregation. Its still a bit tricky, because you cannot simply run the avg_bucket on a date_historgram aggregation result, but with a secondary value_count aggregation with some unique value and it works fine :)
{
"size": 0,
"aggs": {
"orders_per_day": {
"date_histogram": {
"field": "orderedDate",
"interval": "day"
},
"aggs": {
"amount": {
"value_count": {
"field": "dateCreated"
}
}
}
},
"avg_daily_order": {
"avg_bucket": {
"buckets_path": "orders_per_day>amount"
}
}
}
}

There is a pretty convoluted solution and not really performant, using the following scripted_metric aggregation.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"avg_doc_per_biz_day": {
"scripted_metric": {
"init_script": "_agg.bizdays = []; _agg.allbizdays = [:]; start = new DateTime(1970, 1, 1, 0, 0); now = new DateTime(); while (start < now) { def end = start.plusMonths(1); _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; start = end; }",
"map_script": "_agg.bizdays << _agg.allbizdays[doc. created_date.date.year+'_'+doc. created_date.date.monthOfYear]",
"combine_script": "_agg.allbizdays = null; doc_count = 0; for (d in _agg.bizdays){ doc_count++ }; return doc_count / _agg.bizdays[0]",
"reduce_script": "res = 0; for (a in _aggs) { res += a }; return res"
}
}
}
}
}
}
Let's detail each script below.
What I'm doing in init_script is creating a map of the number of business days for each month since 1970 and storing that in the _agg.allbizdays map.
_agg.bizdays = [];
_agg.allbizdays = [:];
start = new DateTime(1970, 1, 1, 0, 0);
now = new DateTime();
while (start < now) {
def end = start.plusMonths(1);
_agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 };
start = end;
}
In map_script, I'm simply retrieving the number of weekdays for the month of each document;
_agg.bizdays << _agg.allbizdays[doc.created_date.date.year + '_' + doc. created_date.date.monthOfYear];
In combine_script, I'm summing up the average doc count for each shard
_agg.allbizdays = null;
doc_count = 0;
for (d in _agg.bizdays){ doc_count++ };
return doc_count / _agg.bizdays[0];
And finally in reduce_script, I'm summing up the average doc count for each node:
res = 0;
for (a in _aggs) { res += a };
return res
Again I think it's pretty convoluted and as Andrei rightly said it, it is probably better to wait for 2.0 to make it work the way it should, but in the meantime you have this solution, if you need it.

What you basically need is something like this (which doesn't work, as it's not an available feature):
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"average": {
"avg": {
"script": "doc_count / 20"
}
}
}
}
}
}
It doesn't work because there is not way of accessing the doc_count from the "parent" aggregation.
But, this will be possible in the 2.x branch of Elasticsearch and, at the moment, it's being actively developed: https://github.com/elastic/elasticsearch/issues/8110
This new feature will add a second layer of manipulation over the results (buckets) of an aggregation and it's not only your usecase, but many others.
Unless you want to try some ideas out there or perform your own calculations in your app, you need to wait for this feature.

You want to exclude documents with timestamp on Saturday and Sunday, so you can exclude those documents in your query using a script
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc['#timestamp'].date.dayOfWeek != 7 && doc['#timestamp'].date.dayOfWeek != 6"
}
}
}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"docs_per_day": {
"date_histogram": {
"field": "created_date",
"interval": "day",
"min_doc_count": 0
}
},
"aggs": {
"docs_count": {
"avg": {
"field": ""
}
}
}
}
}
}
}
You may not need the first aggregation by month, since you already have this information using day interval
BTW you need to make sure dynamic scripting is enabled by adding this to your elasticsearch.yml configuration
script.disable_dynamic: false
Or add a groovy script under /config/scripts and use a filtered query with a script in filter

Related

How to calculate & draw metrics of scripted metrics in elasticsearch

Problem description
we have log files from different devices parsed into our elastic search database, line by line. The log files are built as a ring buffer, so they always have a fixed size of 1000 lines. They can be manually exported whenever needed. After import and parsing in elastic search each document represents a single line of a log file with the following information:
DeviceID: 12345
FileType: ErrorLog
FileTimestamp: 2022-05-10 01:23:45
LogTimestamp: 2022-05-05 01:23:45
LogMessage: something very important here
Now I want to have a statistic on the timespan that usually is covered by that fixed amount of lines. Because, depending on the intensity of the usage of the device, a varying amount of log entries is generated and the files can cover from just a few days to several months... But since the log files are split into individual lines it is not that trivial (I suppose).
My goal is to have a chart that shows me a "histogram" of the different log file timespans...
First Try: Visualize library > Data table
I started by creating a Data table in the Visualize library where I was able to aggregate the data as follows:
I added 3 Buckets --> so I have all lines bucketed by their original file:
Split rows DeviceID.keyword
Split rows FileType.keyword
Split rows FileTimestamp
... and 2 Metrics --> to show the log file timespan (I couldn't find a way to create a max-min metric, so I started with individual metrics for max and min):
Metric Min LogTimeStamp
Metric Max LogTimeStamp
This results in the following query:
{
"aggs": {
"2": {
"terms": {
"field": "DeviceID.keyword",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"3": {
"terms": {
"field": "FileType.keyword",
"order": {
"_key": "desc"
},
"size": 5
},
"aggs": {
"4": {
"terms": {
"field": "FileTimestamp",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"1": {
"min": {
"field": "LogTimeStamp"
}
},
"5": {
"max": {
"field": "LogTimeStamp"
}
}
}
}
}
}
}
}
},
"size": 0,
...
}
... and this output:
DeviceID FileType FileTimestamp Min LogTimestamp Max LogTimestamp
---------------------------------------------------------------------------------------------
12345 ErrorLog 2022-05-10 01:23:45 2022-04-10 01:23:45 2022-05-10 01:23:45
...
Looks good so far! The expected result would be exactly 1 month for this example.
But my research showed, that it is not possible to add the desired metrics here, so I needed to try something else...
Second Try: Vizualize library > Custom visualization (Vega-Lite)
So I started some more research and found out, that vega might be a possibility. I already was able to transfer the bucket part from the first attempt there and I also added a scripted metric to automatically calculate the timespan (instead of min & max), so far, so good. The request body looks as follows:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['#timestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
}
}
}
}
}
},
"size": 0,
}
...with this response (unnecessary information removed to reduce complexity):
{
"took": 12245,
"timed_out": false,
"_shards": { ... },
"hits": { ... },
"aggregations": {
"DeviceID": {
"buckets": [
{
"key": "12345",
"FileType": {
"buckets": [
{
"key": "ErrorLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1638447972000,
"key_as_string": "2021-12-02T12:26:12.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
}
]
}
},
{
"key": "InfoLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1635773438000,
"key_as_string": "2021-11-01T13:30:38.000Z",
"doc_count": 1000,
"timespan": {
"value": 2793365000
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 2643772000
}
}
]
}
}
]
}
},
{
"key": "12346",
"FileType": {
...
}
},
...
]
}
}
}
Yeah, it seems to work! Now I have the timespan for each original log file.
Question
Now I am stuck with:
I want to average the timespans for each original log file (identified via the combination of DeviceID + FileType + FileTimeStamp) to prevent devices with multiple log files imported to have a higher weight, than devices with only 1 log file imported. I tried to add another aggregation for the avg, but I couldn't figure out where to put so that the result of the scripted_metric is used. My closest attempt was to put a avg_bucket after the FileTimeStamp bucket:
Request:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['FileTimestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
},
// new part - start
"avg_timespan": {
"avg_bucket": {
"buckets_path": "FileTimestamp>timespan"
}
}
// new part - end
}
}
}
}
},
"size": 0,
}
But I receive the following error:
EsError: buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [timespan]
So is it the right spot? (but not applicable to a scripted metric) Or am I on the wrong path?
I need to plot all this, but I can't find my way through all the buckets, etc.
I read about flattening (which would probably be a good idea, so (if done by the server) the result would not be that complex), but don't know where and how to put the flattening transformation.
I imagine the resulting chart like this:
x-axis = log file timespan, where the timespan is "binned" according to a given step size (e.g. 1 day), so there are only bars for each bin (1 = 0-1days, 2 = 1-2days, 3 = 2-3days, etc.) and not for all the different timespans of log files
y-axis = count of devices
type: lines or vertical bars, split by file type
e.g. something like this:
Any help is really appreciated! Thanks in advance!
If you have the privileges to create a transform, then the elastic painless example Getting duration by using bucket script can do exactly what you want. It creates a new index where all documents are grouped according to your needs.
To create the transform:
go to Stack Management > Transforms > + Create a transform
select Edit JSON config for the Pivot configuration object
paste & apply the JSON below
check whether the result is the expected in the Transform preview
fill out the rest of the transform details + save the transform
JSON config
{
"group_by": {
"DeviceID": {
"terms": {
"field": "DeviceID.keyword"
}
},
"FileType": {
"terms": {
"field": "FileType.keyword"
}
},
"FileTimestamp": {
"terms": {
"field": "FileTimestamp"
}
}
},
"aggregations": {
"TimeStampStats": {
"stats": {
"field": "#timestamp"
}
},
"TimeSpan": {
"bucket_script": {
"buckets_path": {
"first": "TimeStampStats.min",
"last": "TimeStampStats.max"
},
"script": "params.last - params.first"
}
}
}
}
Now you can create a chart from the new index, for example with these settings:
Vertical Bars
Metrics:
Y-axis = "Count"
Buckets:
X-axis = "TimeSpan"
Split series = "FileType"

Alternative solution to Cumulative Cardinality Aggregation in Elasticsearch

I'm running an Elasticsearch cluster that doesn't have access to x-packs on AWS, but I'd still like to do a cumulative cardinality aggregation to determine the daily counts of new users to my site.
Is there an alternate solution to this problem?
For example, how could I transform:
GET /user_hits/_search
{
"size": 0,
"aggs": {
"users_per_day": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day"
},
"aggs": {
"distinct_users": {
"cardinality": {
"field": "user_id"
}
},
"total_new_users": {
"cumulative_cardinality": {
"buckets_path": "distinct_users"
}
}
}
}
}
}
To produce the same result without cumulative_cardinality?
Cumulative cardinality was added precisely for that reason -- it wasn't easily calculable before...
As with almost anything in ElasticSearch, though, there's a script to get it done for ya. Here's my take on it.
Set up an index
PUT user_hits
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd"
},
"user_id": {
"type": "keyword"
}
}
}
}
Add 1 new user in one day and 2 more the day after, one of which is not strictly 'new'.
POST user_hits/_doc
{"user_id":1,"timestamp":"2020-10-01"}
POST user_hits/_doc
{"user_id":1,"timestamp":"2020-10-02"}
POST user_hits/_doc
{"user_id":3,"timestamp":"2020-10-02"}
Mock a date histogram using a parametrized start + number of day, group the users accordingly, and then compare the days' results vis-à-vis
GET /user_hits/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2020-10-01"
}
}
},
"aggs": {
"new_users_count_vs_prev_day": {
"scripted_metric": {
"init_script": """
state.by_day_map = [:];
state.start_millis = new SimpleDateFormat("yyyy-MM-dd").parse(params.start_date).getTime();
state.day_millis = 24 * 60 * 60 * 1000;
state.dt_formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(ZoneOffset.UTC);
""",
"map_script": """
for (def step = 1; step < params.num_of_days + 1; step++) {
def timestamp = doc.timestamp.value.millis;
def user_id = doc['user_id'].value;
def anchor = state.start_millis + (step * state.day_millis);
// add a `n__` prefix to more easily sort the resulting map later on
def anchor_pretty = step + '__' + state.dt_formatter.format(Instant.ofEpochMilli(anchor));
if (timestamp <= anchor) {
if (state.by_day_map.containsKey(anchor_pretty)) {
state.by_day_map[anchor_pretty].add(user_id);
} else {
state.by_day_map[anchor_pretty] = [user_id];
}
}
}
""",
"combine_script": """
List keys=new ArrayList(state.by_day_map.keySet());
Collections.sort(keys);
def unique_sorted_map = new TreeMap();
def unique_from_prev_day = [];
for (def key : keys) {
def unique_users_per_day = new HashSet(state.by_day_map.get(key));
unique_users_per_day.removeIf(user -> unique_from_prev_day.contains(user));
// remove the `n__` prefix
unique_sorted_map.put(key.substring(3), unique_users_per_day.size());
unique_from_prev_day.addAll(unique_users_per_day);
}
return unique_sorted_map
""",
"reduce_script": "return states",
"params": {
"start_date": "2020-10-01",
"num_of_days": 5
}
}
}
}
}
yielding
"aggregations" : {
"new_users_count_vs_prev_day" : {
"value" : [
{
"2020-10-01" : 1, <-- 1 new unique user
"2020-10-02" : 1, <-- another new unique user
"2020-10-03" : 0,
"2020-10-04" : 0,
"2020-10-05" : 0
}
]
}
}
The script is guaranteed to be slow but has one, potentially quite useful, advantage -- you can adjust it to return the full list of new user IDs, not just the count that you'd get from the cumulative cardinality which, according to its implementation's author, only works in a sequential, cumulative manner by design.

How to get the latest value for the bucket in Elasticsearch?

I have a bunch of documents with just count field.
I'm trying to get the latest value for that field aggregated by date:
{
"query": {
"match_all": {}
},
"sort": "_timestamp",
"aggs": {
"result": {
"date_histogram": {
"field": "_timestamp",
"interval": "day",
"min_doc_count": 0
},
"aggs": {
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"last_value": 0
}
},
"map_script": "_agg.last_value = doc['count'].value",
"reduce_script": "return _aggs.last().last_value"
}
}
}
}
}
}
But the problem here is that documents fall into last_value aggregation not sorted by _timestamp, so I can't guarantee that the last value is really the last value.
So, my questions:
Is it possible to sort data by _timestamp when performing last_value aggregation?
Is there any better way to get the last value aggregated by day?
Looks like it is possible to tune scripted_metric aggregations a little bit to solve the first part of the question (sorting by _timestamp):
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"value": 0,
"timestamp": 0
}
},
"map_script": "_agg.value = doc['count'].value; _agg.timestamp = doc['_timestamp'].value",
"reduce_script": "value = 0; timestamp=0; for (a in _aggs) { if(a.timestamp > timestamp){ value = a.value; timestamp = a.timestamp} }; return value;"
}
}
But I continue to doubt that this is the best way to solve that

How to subtract aggregate min from aggreagate max(difference) in ES?

How to write an ES query to find the difference between max and min value of a field?
I am a newbee in elastic search,
In my case I feed lot of events along with session_id and time in to elastic search.
My event structure is
Event_name string `json:"Event_name"`
Client_id string `json:"Client_id"`
App_id string `json:"App_id"`
Session_id string `json:"Session_id"`
User_id string `json:"User_id"`
Ip_address string `json:"Ip_address"`
Latitude int64 `json:"Latitude"`
Longitude int64 `json:"Longitude"`
Event_time time.Time `json:"Time"`
I want to find the life time of a session_id based the feeded events.
For that I can retrive the maximum Event_time and minimum Event_time for a particular session_id by the following ES query.
{
"size": 0,
"query": {
"match": {
"Session_id": "dummySessionId"
}
},
"aggs": {
"max_time": {
"max": {
"field": "Time"
}
},
"min_time":{
"min": {
"field": "Time"
}
}
}
}
But what I exact want is (max_time - min_time)
How to write the ES query for the same????
Up to elasticsearch 1.1.1, this is not possible to do any arithmetic operation upon two aggregate function's reasult from elasticsearch side.
If you want then, you should do that from client side.
That is neither possible through scripts, as #eliasah suggests.
In the upcoming versions they may be add such facility.
in the 1.5.1 using the Scripted Metric Aggregation you can do this. Not sure about the performance, but it looks to work. This functionality is experimental and may be changed or removed completely in a future release.
POST test_time
POST test_time/data/1
{"Session_id":1234,"Event_time":"2014-01-01T12:00:00"}
POST test_time/data/3
{"Session_id":1234,"Event_time":"2014-01-01T14:00:00"}
GET /test_time/_search
{
"size": 0,
"aggs": {
"by_user": {
"terms": {
"field": "Session_id"
},
"aggs": {
"session_lenght_sec": {
"scripted_metric": {
"map_script": "_agg['v'] = doc['Event_time'].value",
"reduce_script": "min = null; max = null; for (a in _aggs) {if (min == null || a.v < min) { min = a.v}; if (max == null || a.v > max) { max = a.v }}; return (max-min)/1000"
}
}
}
}
}
}
###### RESPONSE #######
{
...,
"aggregations": {
"by_user": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1234,
"doc_count": 2,
"session_lenght_sec": {
"value": "7200"
}
}
]
}
}
}
This answer is bound to the Elasticsearch 7.8 version.
Followed up the #pippobaudos answer ahead. Elasticsearch has made some major changes since the answer.
the aggregation has a type 'scripted_metric' (click on the link to know more), which has new sub-attributes such as init_script, map_script, combine_script, reduce_script. Out of which, only init_script is optional. Following is the modified query.
"aggs": {
"cumulative":{
"scripted_metric": {
"init_script": {
"source": "state.stars = []"
},
"map_script": {
"source": "if (doc.containsKey('star_count')) { state.stars.add(doc['star_count'].value); }"
},
"combine_script": {
"source": "long min=9223372036854775807L,max=-9223372036854775808L; for (a in state.stars) {if ( a < min) { min = a;} if ( a > max) { max = a; }} return (max-min)"
},
"reduce_script": {
"source": "long max = -9223372036854775808L; for (a in states) { if (a != null && a > max){ max=a; } } return max "
}
}
}
}
Giving directly the query will not help you much, so I suggest you read the documentation about Script Fields and Scripting.

Getting count and grouping by date range in elastic search

Is there a way to get the count of rows and group them by hour, day or month.
For instance, assume I have the messages
_source{
"timestamp":"2013-10-01T12:30:25.421Z",
"amount":200
}
_source{
"timestamp":"2013-10-01T12:35:25.421Z",
"amount":300
}
_source{
"timestamp":"2013-10-02T13:53:25.421Z",
"amount":100
}
_source{
"timestamp":"2013-10-03T15:53:25.421Z",
"amount":400
}
Is there a way to get something alone the lines of {date, sum} (not necessarily in this format, just wondering if there is any way i can achieve this)
{
{"2013-10-01T12:00:00.000Z", 500},
{"2013-10-02T13:00:00.000Z", 100},
{"2013-10-03T15:00:00.000Z", 400}
}
Thank you
Try with aggregations.
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "timestamp",
"interval": "week"
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
In addition, if you wanna count number of indexes replace sum content by:
"sum": {
"script": "1"
}
Hope it helps.
I need Query to fetch data from ElasticeSearch for count of month wise and count of Year wise registered Customer in our platform.
Below Queries are perfectly working and giving data correctly:
here : CustOnboardedOn : is Feild when Cust
Method type: POST
URL: http://SomeIP:9200/customer/_search?size=0
ES Query for Month wise aggregated customer
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "CustOnboardedOn",
"interval": "month"
}
}
}
}
ES Query: Year wise Aggregation.
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "CustOnboardedOn",
"interval": "year"
}
}
}
}

Resources