Alternative solution to Cumulative Cardinality Aggregation in Elasticsearch - elasticsearch

I'm running an Elasticsearch cluster that doesn't have access to x-packs on AWS, but I'd still like to do a cumulative cardinality aggregation to determine the daily counts of new users to my site.
Is there an alternate solution to this problem?
For example, how could I transform:
GET /user_hits/_search
{
"size": 0,
"aggs": {
"users_per_day": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day"
},
"aggs": {
"distinct_users": {
"cardinality": {
"field": "user_id"
}
},
"total_new_users": {
"cumulative_cardinality": {
"buckets_path": "distinct_users"
}
}
}
}
}
}
To produce the same result without cumulative_cardinality?

Cumulative cardinality was added precisely for that reason -- it wasn't easily calculable before...
As with almost anything in ElasticSearch, though, there's a script to get it done for ya. Here's my take on it.
Set up an index
PUT user_hits
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd"
},
"user_id": {
"type": "keyword"
}
}
}
}
Add 1 new user in one day and 2 more the day after, one of which is not strictly 'new'.
POST user_hits/_doc
{"user_id":1,"timestamp":"2020-10-01"}
POST user_hits/_doc
{"user_id":1,"timestamp":"2020-10-02"}
POST user_hits/_doc
{"user_id":3,"timestamp":"2020-10-02"}
Mock a date histogram using a parametrized start + number of day, group the users accordingly, and then compare the days' results vis-à-vis
GET /user_hits/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2020-10-01"
}
}
},
"aggs": {
"new_users_count_vs_prev_day": {
"scripted_metric": {
"init_script": """
state.by_day_map = [:];
state.start_millis = new SimpleDateFormat("yyyy-MM-dd").parse(params.start_date).getTime();
state.day_millis = 24 * 60 * 60 * 1000;
state.dt_formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(ZoneOffset.UTC);
""",
"map_script": """
for (def step = 1; step < params.num_of_days + 1; step++) {
def timestamp = doc.timestamp.value.millis;
def user_id = doc['user_id'].value;
def anchor = state.start_millis + (step * state.day_millis);
// add a `n__` prefix to more easily sort the resulting map later on
def anchor_pretty = step + '__' + state.dt_formatter.format(Instant.ofEpochMilli(anchor));
if (timestamp <= anchor) {
if (state.by_day_map.containsKey(anchor_pretty)) {
state.by_day_map[anchor_pretty].add(user_id);
} else {
state.by_day_map[anchor_pretty] = [user_id];
}
}
}
""",
"combine_script": """
List keys=new ArrayList(state.by_day_map.keySet());
Collections.sort(keys);
def unique_sorted_map = new TreeMap();
def unique_from_prev_day = [];
for (def key : keys) {
def unique_users_per_day = new HashSet(state.by_day_map.get(key));
unique_users_per_day.removeIf(user -> unique_from_prev_day.contains(user));
// remove the `n__` prefix
unique_sorted_map.put(key.substring(3), unique_users_per_day.size());
unique_from_prev_day.addAll(unique_users_per_day);
}
return unique_sorted_map
""",
"reduce_script": "return states",
"params": {
"start_date": "2020-10-01",
"num_of_days": 5
}
}
}
}
}
yielding
"aggregations" : {
"new_users_count_vs_prev_day" : {
"value" : [
{
"2020-10-01" : 1, <-- 1 new unique user
"2020-10-02" : 1, <-- another new unique user
"2020-10-03" : 0,
"2020-10-04" : 0,
"2020-10-05" : 0
}
]
}
}
The script is guaranteed to be slow but has one, potentially quite useful, advantage -- you can adjust it to return the full list of new user IDs, not just the count that you'd get from the cumulative cardinality which, according to its implementation's author, only works in a sequential, cumulative manner by design.

Related

How to calculate & draw metrics of scripted metrics in elasticsearch

Problem description
we have log files from different devices parsed into our elastic search database, line by line. The log files are built as a ring buffer, so they always have a fixed size of 1000 lines. They can be manually exported whenever needed. After import and parsing in elastic search each document represents a single line of a log file with the following information:
DeviceID: 12345
FileType: ErrorLog
FileTimestamp: 2022-05-10 01:23:45
LogTimestamp: 2022-05-05 01:23:45
LogMessage: something very important here
Now I want to have a statistic on the timespan that usually is covered by that fixed amount of lines. Because, depending on the intensity of the usage of the device, a varying amount of log entries is generated and the files can cover from just a few days to several months... But since the log files are split into individual lines it is not that trivial (I suppose).
My goal is to have a chart that shows me a "histogram" of the different log file timespans...
First Try: Visualize library > Data table
I started by creating a Data table in the Visualize library where I was able to aggregate the data as follows:
I added 3 Buckets --> so I have all lines bucketed by their original file:
Split rows DeviceID.keyword
Split rows FileType.keyword
Split rows FileTimestamp
... and 2 Metrics --> to show the log file timespan (I couldn't find a way to create a max-min metric, so I started with individual metrics for max and min):
Metric Min LogTimeStamp
Metric Max LogTimeStamp
This results in the following query:
{
"aggs": {
"2": {
"terms": {
"field": "DeviceID.keyword",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"3": {
"terms": {
"field": "FileType.keyword",
"order": {
"_key": "desc"
},
"size": 5
},
"aggs": {
"4": {
"terms": {
"field": "FileTimestamp",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"1": {
"min": {
"field": "LogTimeStamp"
}
},
"5": {
"max": {
"field": "LogTimeStamp"
}
}
}
}
}
}
}
}
},
"size": 0,
...
}
... and this output:
DeviceID FileType FileTimestamp Min LogTimestamp Max LogTimestamp
---------------------------------------------------------------------------------------------
12345 ErrorLog 2022-05-10 01:23:45 2022-04-10 01:23:45 2022-05-10 01:23:45
...
Looks good so far! The expected result would be exactly 1 month for this example.
But my research showed, that it is not possible to add the desired metrics here, so I needed to try something else...
Second Try: Vizualize library > Custom visualization (Vega-Lite)
So I started some more research and found out, that vega might be a possibility. I already was able to transfer the bucket part from the first attempt there and I also added a scripted metric to automatically calculate the timespan (instead of min & max), so far, so good. The request body looks as follows:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['#timestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
}
}
}
}
}
},
"size": 0,
}
...with this response (unnecessary information removed to reduce complexity):
{
"took": 12245,
"timed_out": false,
"_shards": { ... },
"hits": { ... },
"aggregations": {
"DeviceID": {
"buckets": [
{
"key": "12345",
"FileType": {
"buckets": [
{
"key": "ErrorLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1638447972000,
"key_as_string": "2021-12-02T12:26:12.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
}
]
}
},
{
"key": "InfoLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1635773438000,
"key_as_string": "2021-11-01T13:30:38.000Z",
"doc_count": 1000,
"timespan": {
"value": 2793365000
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 2643772000
}
}
]
}
}
]
}
},
{
"key": "12346",
"FileType": {
...
}
},
...
]
}
}
}
Yeah, it seems to work! Now I have the timespan for each original log file.
Question
Now I am stuck with:
I want to average the timespans for each original log file (identified via the combination of DeviceID + FileType + FileTimeStamp) to prevent devices with multiple log files imported to have a higher weight, than devices with only 1 log file imported. I tried to add another aggregation for the avg, but I couldn't figure out where to put so that the result of the scripted_metric is used. My closest attempt was to put a avg_bucket after the FileTimeStamp bucket:
Request:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['FileTimestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
},
// new part - start
"avg_timespan": {
"avg_bucket": {
"buckets_path": "FileTimestamp>timespan"
}
}
// new part - end
}
}
}
}
},
"size": 0,
}
But I receive the following error:
EsError: buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [timespan]
So is it the right spot? (but not applicable to a scripted metric) Or am I on the wrong path?
I need to plot all this, but I can't find my way through all the buckets, etc.
I read about flattening (which would probably be a good idea, so (if done by the server) the result would not be that complex), but don't know where and how to put the flattening transformation.
I imagine the resulting chart like this:
x-axis = log file timespan, where the timespan is "binned" according to a given step size (e.g. 1 day), so there are only bars for each bin (1 = 0-1days, 2 = 1-2days, 3 = 2-3days, etc.) and not for all the different timespans of log files
y-axis = count of devices
type: lines or vertical bars, split by file type
e.g. something like this:
Any help is really appreciated! Thanks in advance!
If you have the privileges to create a transform, then the elastic painless example Getting duration by using bucket script can do exactly what you want. It creates a new index where all documents are grouped according to your needs.
To create the transform:
go to Stack Management > Transforms > + Create a transform
select Edit JSON config for the Pivot configuration object
paste & apply the JSON below
check whether the result is the expected in the Transform preview
fill out the rest of the transform details + save the transform
JSON config
{
"group_by": {
"DeviceID": {
"terms": {
"field": "DeviceID.keyword"
}
},
"FileType": {
"terms": {
"field": "FileType.keyword"
}
},
"FileTimestamp": {
"terms": {
"field": "FileTimestamp"
}
}
},
"aggregations": {
"TimeStampStats": {
"stats": {
"field": "#timestamp"
}
},
"TimeSpan": {
"bucket_script": {
"buckets_path": {
"first": "TimeStampStats.min",
"last": "TimeStampStats.max"
},
"script": "params.last - params.first"
}
}
}
}
Now you can create a chart from the new index, for example with these settings:
Vertical Bars
Metrics:
Y-axis = "Count"
Buckets:
X-axis = "TimeSpan"
Split series = "FileType"

ElasticSearch Filter by sum of nested documents

I am trying to filter products where a sum of properties in the nested filtered objects is in some range.
I have the following mapping:
{
"product": {
"properties": {
"warehouses": {
"type": "nested",
"properties": {
"stock_level": {
"type": "integer"
}
}
}
}
}
}
Example data:
{
"id": 1,
"warehouses": [
{
"id": 2001,
"stock_level": 5
},
{
"id": 2002,
"stock_level": 0
},
{
"id": 2003,
"stock_level": 2
}
]
}
In ElasticSearch 5.6 I used to do this:
GET products/_search
{
"query": {
"bool": {
"filter": [
[
{
"script": {
"script": {
"source": """
int total = 0;
for (def warehouse: params['_source']['warehouses']) {
if (params.warehouse_ids == null || params.warehouse_ids.contains(warehouse.id)) {
total += warehouse.stock_level;
}
}
boolean gte = true;
boolean lte = true;
if (params.gte != null) {
gte = (total >= params.gte);
}
if (params.lte != null) {
lte = (total <= params.lte);
}
return (gte && lte);
""",
"lang": "painless",
"params": {
"gte": 4
}
}
}
}
]
]
}
}
}
The problem is that params['_source']['warehouses'] no longer works in ES 6.8, and I am unable to find a way to access nested documents in the script.
I have tried:
doc['warehouses'] - returns error (“No field found for [warehouses] in mapping with types []" )
ctx._source.warehouses - “Variable [ctx] is not defined.”
I have also tried to use scripted_field but it seems that scripted fields are getting calculated on the very last stage and are not available during query.
I also have a sorting by the same logic (sort products by the sum of stocks in the given warehouses), and it works like a charm:
"sort": {
"warehouses.stock_level": {
"order": "desc",
"mode": "sum",
"nested": {
"path": "warehouses"
"filter": {
"terms": {
"warehouses.id": [2001, 2003]
}
}
}
}
}
But I can't find a way to access this sort value either :(
Any ideas how can I achieve this? Thanks.
I recently had the same issue. It turns out the change occurred somewhere around 6.4 during refactoring and while accessing _source is strongly discouraged, it looks like people are still using / wanting to use it.
Here's a workaround taking advantage of the include_in_root parameter.
Adjust your mapping
PUT product
{
"mappings": {
"properties": {
"warehouses": {
"type": "nested",
"include_in_root": true, <--
"properties": {
"stock_level": {
"type": "integer"
}
}
}
}
}
}
Drop & reindex
Reconstruct the individual warehouse items in a for loop while accessing the flattened values:
GET product/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": """
int total = 0;
def ids = doc['warehouses.id'];
def levels = doc['warehouses.stock_level'];
for (def i = 0; i < ids.length; i++) {
def warehouse = ['id':ids[i], 'stock_level':levels[i]];
if (params.warehouse_ids == null || params.warehouse_ids.contains(warehouse.id)) {
total += warehouse.stock_level;
}
}
boolean gte = true;
boolean lte = true;
if (params.gte != null) {
gte = (total >= params.gte);
}
if (params.lte != null) {
lte = (total <= params.lte);
}
return (gte && lte);
""",
"lang": "painless",
"params": {
"gte": 4
}
}
}
}
]
}
}
}
Be aware that this approach assumes that all warehouses include a non-null id and stock level.

Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.
Here is my object:
{
"name": "name",
"identificationHash": "aslkdakldjka",
"isDupe": false,
...
}
My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.
Currently what I'm doing is:
I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
"query": {
"bool": {
"must": [
{
"term": {
"isDupe": {
"value": false,
"boost": 1
}
}
}
]
}
},
"aggregations": {
"identificationHashCount": {
"terms": {
"field": "identificationHash",
"size": 10000,
"min_doc_count": 2
}
}
}
}
With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.

Elasticsearch average over date histogram buckets

I've got a bunch of documents indexed in ElasticSearch, and I need to get the following data:
For each month, get the average number of documents per working day of the month (or if impossible, use 20 days as the default).
I already aggregated my data into months buckets using the date histogram aggregation. I tried to nest a stats bucket, but this aggregations uses data extracted from the document's field, not from the parent bucket.
Here is my query so far:
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
}
"aggs": {
'???': '???'
}
}
}
}
edit
To make my question clearer, what I need is:
Get the total of numbers of documents created for the month (which is already done thanks to the date_histogram aggregation)
Get the number of working days for the month
Divide the first by the second.
For anyone still interested, you can now do with with the avg_bucket aggregation. Its still a bit tricky, because you cannot simply run the avg_bucket on a date_historgram aggregation result, but with a secondary value_count aggregation with some unique value and it works fine :)
{
"size": 0,
"aggs": {
"orders_per_day": {
"date_histogram": {
"field": "orderedDate",
"interval": "day"
},
"aggs": {
"amount": {
"value_count": {
"field": "dateCreated"
}
}
}
},
"avg_daily_order": {
"avg_bucket": {
"buckets_path": "orders_per_day>amount"
}
}
}
}
There is a pretty convoluted solution and not really performant, using the following scripted_metric aggregation.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"avg_doc_per_biz_day": {
"scripted_metric": {
"init_script": "_agg.bizdays = []; _agg.allbizdays = [:]; start = new DateTime(1970, 1, 1, 0, 0); now = new DateTime(); while (start < now) { def end = start.plusMonths(1); _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; start = end; }",
"map_script": "_agg.bizdays << _agg.allbizdays[doc. created_date.date.year+'_'+doc. created_date.date.monthOfYear]",
"combine_script": "_agg.allbizdays = null; doc_count = 0; for (d in _agg.bizdays){ doc_count++ }; return doc_count / _agg.bizdays[0]",
"reduce_script": "res = 0; for (a in _aggs) { res += a }; return res"
}
}
}
}
}
}
Let's detail each script below.
What I'm doing in init_script is creating a map of the number of business days for each month since 1970 and storing that in the _agg.allbizdays map.
_agg.bizdays = [];
_agg.allbizdays = [:];
start = new DateTime(1970, 1, 1, 0, 0);
now = new DateTime();
while (start < now) {
def end = start.plusMonths(1);
_agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 };
start = end;
}
In map_script, I'm simply retrieving the number of weekdays for the month of each document;
_agg.bizdays << _agg.allbizdays[doc.created_date.date.year + '_' + doc. created_date.date.monthOfYear];
In combine_script, I'm summing up the average doc count for each shard
_agg.allbizdays = null;
doc_count = 0;
for (d in _agg.bizdays){ doc_count++ };
return doc_count / _agg.bizdays[0];
And finally in reduce_script, I'm summing up the average doc count for each node:
res = 0;
for (a in _aggs) { res += a };
return res
Again I think it's pretty convoluted and as Andrei rightly said it, it is probably better to wait for 2.0 to make it work the way it should, but in the meantime you have this solution, if you need it.
What you basically need is something like this (which doesn't work, as it's not an available feature):
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"average": {
"avg": {
"script": "doc_count / 20"
}
}
}
}
}
}
It doesn't work because there is not way of accessing the doc_count from the "parent" aggregation.
But, this will be possible in the 2.x branch of Elasticsearch and, at the moment, it's being actively developed: https://github.com/elastic/elasticsearch/issues/8110
This new feature will add a second layer of manipulation over the results (buckets) of an aggregation and it's not only your usecase, but many others.
Unless you want to try some ideas out there or perform your own calculations in your app, you need to wait for this feature.
You want to exclude documents with timestamp on Saturday and Sunday, so you can exclude those documents in your query using a script
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc['#timestamp'].date.dayOfWeek != 7 && doc['#timestamp'].date.dayOfWeek != 6"
}
}
}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"docs_per_day": {
"date_histogram": {
"field": "created_date",
"interval": "day",
"min_doc_count": 0
}
},
"aggs": {
"docs_count": {
"avg": {
"field": ""
}
}
}
}
}
}
}
You may not need the first aggregation by month, since you already have this information using day interval
BTW you need to make sure dynamic scripting is enabled by adding this to your elasticsearch.yml configuration
script.disable_dynamic: false
Or add a groovy script under /config/scripts and use a filtered query with a script in filter

How to get the latest value for the bucket in Elasticsearch?

I have a bunch of documents with just count field.
I'm trying to get the latest value for that field aggregated by date:
{
"query": {
"match_all": {}
},
"sort": "_timestamp",
"aggs": {
"result": {
"date_histogram": {
"field": "_timestamp",
"interval": "day",
"min_doc_count": 0
},
"aggs": {
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"last_value": 0
}
},
"map_script": "_agg.last_value = doc['count'].value",
"reduce_script": "return _aggs.last().last_value"
}
}
}
}
}
}
But the problem here is that documents fall into last_value aggregation not sorted by _timestamp, so I can't guarantee that the last value is really the last value.
So, my questions:
Is it possible to sort data by _timestamp when performing last_value aggregation?
Is there any better way to get the last value aggregated by day?
Looks like it is possible to tune scripted_metric aggregations a little bit to solve the first part of the question (sorting by _timestamp):
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"value": 0,
"timestamp": 0
}
},
"map_script": "_agg.value = doc['count'].value; _agg.timestamp = doc['_timestamp'].value",
"reduce_script": "value = 0; timestamp=0; for (a in _aggs) { if(a.timestamp > timestamp){ value = a.value; timestamp = a.timestamp} }; return value;"
}
}
But I continue to doubt that this is the best way to solve that

Resources