How to get the latest value for the bucket in Elasticsearch? - elasticsearch

I have a bunch of documents with just count field.
I'm trying to get the latest value for that field aggregated by date:
{
"query": {
"match_all": {}
},
"sort": "_timestamp",
"aggs": {
"result": {
"date_histogram": {
"field": "_timestamp",
"interval": "day",
"min_doc_count": 0
},
"aggs": {
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"last_value": 0
}
},
"map_script": "_agg.last_value = doc['count'].value",
"reduce_script": "return _aggs.last().last_value"
}
}
}
}
}
}
But the problem here is that documents fall into last_value aggregation not sorted by _timestamp, so I can't guarantee that the last value is really the last value.
So, my questions:
Is it possible to sort data by _timestamp when performing last_value aggregation?
Is there any better way to get the last value aggregated by day?

Looks like it is possible to tune scripted_metric aggregations a little bit to solve the first part of the question (sorting by _timestamp):
"last_value": {
"scripted_metric": {
"params": {
"_agg": {
"value": 0,
"timestamp": 0
}
},
"map_script": "_agg.value = doc['count'].value; _agg.timestamp = doc['_timestamp'].value",
"reduce_script": "value = 0; timestamp=0; for (a in _aggs) { if(a.timestamp > timestamp){ value = a.value; timestamp = a.timestamp} }; return value;"
}
}
But I continue to doubt that this is the best way to solve that

Related

How to calculate & draw metrics of scripted metrics in elasticsearch

Problem description
we have log files from different devices parsed into our elastic search database, line by line. The log files are built as a ring buffer, so they always have a fixed size of 1000 lines. They can be manually exported whenever needed. After import and parsing in elastic search each document represents a single line of a log file with the following information:
DeviceID: 12345
FileType: ErrorLog
FileTimestamp: 2022-05-10 01:23:45
LogTimestamp: 2022-05-05 01:23:45
LogMessage: something very important here
Now I want to have a statistic on the timespan that usually is covered by that fixed amount of lines. Because, depending on the intensity of the usage of the device, a varying amount of log entries is generated and the files can cover from just a few days to several months... But since the log files are split into individual lines it is not that trivial (I suppose).
My goal is to have a chart that shows me a "histogram" of the different log file timespans...
First Try: Visualize library > Data table
I started by creating a Data table in the Visualize library where I was able to aggregate the data as follows:
I added 3 Buckets --> so I have all lines bucketed by their original file:
Split rows DeviceID.keyword
Split rows FileType.keyword
Split rows FileTimestamp
... and 2 Metrics --> to show the log file timespan (I couldn't find a way to create a max-min metric, so I started with individual metrics for max and min):
Metric Min LogTimeStamp
Metric Max LogTimeStamp
This results in the following query:
{
"aggs": {
"2": {
"terms": {
"field": "DeviceID.keyword",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"3": {
"terms": {
"field": "FileType.keyword",
"order": {
"_key": "desc"
},
"size": 5
},
"aggs": {
"4": {
"terms": {
"field": "FileTimestamp",
"order": {
"_key": "desc"
},
"size": 100
},
"aggs": {
"1": {
"min": {
"field": "LogTimeStamp"
}
},
"5": {
"max": {
"field": "LogTimeStamp"
}
}
}
}
}
}
}
}
},
"size": 0,
...
}
... and this output:
DeviceID FileType FileTimestamp Min LogTimestamp Max LogTimestamp
---------------------------------------------------------------------------------------------
12345 ErrorLog 2022-05-10 01:23:45 2022-04-10 01:23:45 2022-05-10 01:23:45
...
Looks good so far! The expected result would be exactly 1 month for this example.
But my research showed, that it is not possible to add the desired metrics here, so I needed to try something else...
Second Try: Vizualize library > Custom visualization (Vega-Lite)
So I started some more research and found out, that vega might be a possibility. I already was able to transfer the bucket part from the first attempt there and I also added a scripted metric to automatically calculate the timespan (instead of min & max), so far, so good. The request body looks as follows:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['#timestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
}
}
}
}
}
},
"size": 0,
}
...with this response (unnecessary information removed to reduce complexity):
{
"took": 12245,
"timed_out": false,
"_shards": { ... },
"hits": { ... },
"aggregations": {
"DeviceID": {
"buckets": [
{
"key": "12345",
"FileType": {
"buckets": [
{
"key": "ErrorLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1638447972000,
"key_as_string": "2021-12-02T12:26:12.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 31339243240
}
}
]
}
},
{
"key": "InfoLog",
"FileTimeStamp": {
"buckets": [
{
"key": 1635773438000,
"key_as_string": "2021-11-01T13:30:38.000Z",
"doc_count": 1000,
"timespan": {
"value": 2793365000
}
},
{
"key": 1636023881000,
"key_as_string": "2021-11-04T11:04:41.000Z",
"doc_count": 1000,
"timespan": {
"value": 2643772000
}
}
]
}
}
]
}
},
{
"key": "12346",
"FileType": {
...
}
},
...
]
}
}
}
Yeah, it seems to work! Now I have the timespan for each original log file.
Question
Now I am stuck with:
I want to average the timespans for each original log file (identified via the combination of DeviceID + FileType + FileTimeStamp) to prevent devices with multiple log files imported to have a higher weight, than devices with only 1 log file imported. I tried to add another aggregation for the avg, but I couldn't figure out where to put so that the result of the scripted_metric is used. My closest attempt was to put a avg_bucket after the FileTimeStamp bucket:
Request:
body: {
"aggs": {
"DeviceID": {
"terms": { "field": "DeviceID.keyword" },
"aggs": {
"FileType": {
"terms": { "field": "FileType.keyword" } ,
"aggs": {
"FileTimestamp": {
"terms": { "field": "FileTimestamp" } ,
"aggs": {
"timespan": {
"scripted_metric": {
"init_script": "state.values = [];",
"map_script": "state.values.add(doc['FileTimestamp'].value);",
"combine_script": "long min = Long.MAX_VALUE; long max = 0; for (t in state.values) { long tms = t.toInstant().toEpochMilli(); if(tms > max) max = tms; if(tms < min) min = tms; } return [max,min];",
"reduce_script": "long min = Long.MAX_VALUE; long max = 0; for (a in states) { if(a[0] > max) max = a[0]; if(a[1] < min) min = a[1]; } return max-min;"
}
}
}
},
// new part - start
"avg_timespan": {
"avg_bucket": {
"buckets_path": "FileTimestamp>timespan"
}
}
// new part - end
}
}
}
}
},
"size": 0,
}
But I receive the following error:
EsError: buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [timespan]
So is it the right spot? (but not applicable to a scripted metric) Or am I on the wrong path?
I need to plot all this, but I can't find my way through all the buckets, etc.
I read about flattening (which would probably be a good idea, so (if done by the server) the result would not be that complex), but don't know where and how to put the flattening transformation.
I imagine the resulting chart like this:
x-axis = log file timespan, where the timespan is "binned" according to a given step size (e.g. 1 day), so there are only bars for each bin (1 = 0-1days, 2 = 1-2days, 3 = 2-3days, etc.) and not for all the different timespans of log files
y-axis = count of devices
type: lines or vertical bars, split by file type
e.g. something like this:
Any help is really appreciated! Thanks in advance!
If you have the privileges to create a transform, then the elastic painless example Getting duration by using bucket script can do exactly what you want. It creates a new index where all documents are grouped according to your needs.
To create the transform:
go to Stack Management > Transforms > + Create a transform
select Edit JSON config for the Pivot configuration object
paste & apply the JSON below
check whether the result is the expected in the Transform preview
fill out the rest of the transform details + save the transform
JSON config
{
"group_by": {
"DeviceID": {
"terms": {
"field": "DeviceID.keyword"
}
},
"FileType": {
"terms": {
"field": "FileType.keyword"
}
},
"FileTimestamp": {
"terms": {
"field": "FileTimestamp"
}
}
},
"aggregations": {
"TimeStampStats": {
"stats": {
"field": "#timestamp"
}
},
"TimeSpan": {
"bucket_script": {
"buckets_path": {
"first": "TimeStampStats.min",
"last": "TimeStampStats.max"
},
"script": "params.last - params.first"
}
}
}
}
Now you can create a chart from the new index, for example with these settings:
Vertical Bars
Metrics:
Y-axis = "Count"
Buckets:
X-axis = "TimeSpan"
Split series = "FileType"

Compute percentile with collapsing by user

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}
Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.
Here is my object:
{
"name": "name",
"identificationHash": "aslkdakldjka",
"isDupe": false,
...
}
My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.
Currently what I'm doing is:
I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
"query": {
"bool": {
"must": [
{
"term": {
"isDupe": {
"value": false,
"boost": 1
}
}
}
]
}
},
"aggregations": {
"identificationHashCount": {
"terms": {
"field": "identificationHash",
"size": 10000,
"min_doc_count": 2
}
}
}
}
With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.

Elasticsearch average over date histogram buckets

I've got a bunch of documents indexed in ElasticSearch, and I need to get the following data:
For each month, get the average number of documents per working day of the month (or if impossible, use 20 days as the default).
I already aggregated my data into months buckets using the date histogram aggregation. I tried to nest a stats bucket, but this aggregations uses data extracted from the document's field, not from the parent bucket.
Here is my query so far:
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
}
"aggs": {
'???': '???'
}
}
}
}
edit
To make my question clearer, what I need is:
Get the total of numbers of documents created for the month (which is already done thanks to the date_histogram aggregation)
Get the number of working days for the month
Divide the first by the second.
For anyone still interested, you can now do with with the avg_bucket aggregation. Its still a bit tricky, because you cannot simply run the avg_bucket on a date_historgram aggregation result, but with a secondary value_count aggregation with some unique value and it works fine :)
{
"size": 0,
"aggs": {
"orders_per_day": {
"date_histogram": {
"field": "orderedDate",
"interval": "day"
},
"aggs": {
"amount": {
"value_count": {
"field": "dateCreated"
}
}
}
},
"avg_daily_order": {
"avg_bucket": {
"buckets_path": "orders_per_day>amount"
}
}
}
}
There is a pretty convoluted solution and not really performant, using the following scripted_metric aggregation.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"avg_doc_per_biz_day": {
"scripted_metric": {
"init_script": "_agg.bizdays = []; _agg.allbizdays = [:]; start = new DateTime(1970, 1, 1, 0, 0); now = new DateTime(); while (start < now) { def end = start.plusMonths(1); _agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 }; start = end; }",
"map_script": "_agg.bizdays << _agg.allbizdays[doc. created_date.date.year+'_'+doc. created_date.date.monthOfYear]",
"combine_script": "_agg.allbizdays = null; doc_count = 0; for (d in _agg.bizdays){ doc_count++ }; return doc_count / _agg.bizdays[0]",
"reduce_script": "res = 0; for (a in _aggs) { res += a }; return res"
}
}
}
}
}
}
Let's detail each script below.
What I'm doing in init_script is creating a map of the number of business days for each month since 1970 and storing that in the _agg.allbizdays map.
_agg.bizdays = [];
_agg.allbizdays = [:];
start = new DateTime(1970, 1, 1, 0, 0);
now = new DateTime();
while (start < now) {
def end = start.plusMonths(1);
_agg.allbizdays[start.year + '_' + start.monthOfYear] = (start.toDate()..<end.toDate()).sum {(it.day != 6 && it.day != 0) ? 1 : 0 };
start = end;
}
In map_script, I'm simply retrieving the number of weekdays for the month of each document;
_agg.bizdays << _agg.allbizdays[doc.created_date.date.year + '_' + doc. created_date.date.monthOfYear];
In combine_script, I'm summing up the average doc count for each shard
_agg.allbizdays = null;
doc_count = 0;
for (d in _agg.bizdays){ doc_count++ };
return doc_count / _agg.bizdays[0];
And finally in reduce_script, I'm summing up the average doc count for each node:
res = 0;
for (a in _aggs) { res += a };
return res
Again I think it's pretty convoluted and as Andrei rightly said it, it is probably better to wait for 2.0 to make it work the way it should, but in the meantime you have this solution, if you need it.
What you basically need is something like this (which doesn't work, as it's not an available feature):
{
"query": {
"match_all": {}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"average": {
"avg": {
"script": "doc_count / 20"
}
}
}
}
}
}
It doesn't work because there is not way of accessing the doc_count from the "parent" aggregation.
But, this will be possible in the 2.x branch of Elasticsearch and, at the moment, it's being actively developed: https://github.com/elastic/elasticsearch/issues/8110
This new feature will add a second layer of manipulation over the results (buckets) of an aggregation and it's not only your usecase, but many others.
Unless you want to try some ideas out there or perform your own calculations in your app, you need to wait for this feature.
You want to exclude documents with timestamp on Saturday and Sunday, so you can exclude those documents in your query using a script
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc['#timestamp'].date.dayOfWeek != 7 && doc['#timestamp'].date.dayOfWeek != 6"
}
}
}
},
"aggs": {
"docs_per_month": {
"date_histogram": {
"field": "created_date",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"docs_per_day": {
"date_histogram": {
"field": "created_date",
"interval": "day",
"min_doc_count": 0
}
},
"aggs": {
"docs_count": {
"avg": {
"field": ""
}
}
}
}
}
}
}
You may not need the first aggregation by month, since you already have this information using day interval
BTW you need to make sure dynamic scripting is enabled by adding this to your elasticsearch.yml configuration
script.disable_dynamic: false
Or add a groovy script under /config/scripts and use a filtered query with a script in filter

Getting count and grouping by date range in elastic search

Is there a way to get the count of rows and group them by hour, day or month.
For instance, assume I have the messages
_source{
"timestamp":"2013-10-01T12:30:25.421Z",
"amount":200
}
_source{
"timestamp":"2013-10-01T12:35:25.421Z",
"amount":300
}
_source{
"timestamp":"2013-10-02T13:53:25.421Z",
"amount":100
}
_source{
"timestamp":"2013-10-03T15:53:25.421Z",
"amount":400
}
Is there a way to get something alone the lines of {date, sum} (not necessarily in this format, just wondering if there is any way i can achieve this)
{
{"2013-10-01T12:00:00.000Z", 500},
{"2013-10-02T13:00:00.000Z", 100},
{"2013-10-03T15:00:00.000Z", 400}
}
Thank you
Try with aggregations.
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "timestamp",
"interval": "week"
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
In addition, if you wanna count number of indexes replace sum content by:
"sum": {
"script": "1"
}
Hope it helps.
I need Query to fetch data from ElasticeSearch for count of month wise and count of Year wise registered Customer in our platform.
Below Queries are perfectly working and giving data correctly:
here : CustOnboardedOn : is Feild when Cust
Method type: POST
URL: http://SomeIP:9200/customer/_search?size=0
ES Query for Month wise aggregated customer
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "CustOnboardedOn",
"interval": "month"
}
}
}
}
ES Query: Year wise Aggregation.
{
"aggs": {
"amount_per_month": {
"date_histogram": {
"field": "CustOnboardedOn",
"interval": "year"
}
}
}
}

Resources