Bucket aggregation that doesn't depend on the time range in Elasticsearch - elasticsearch

I'm using Elasticsearch 7.9.3 to query time series data metrics which are stored in a form of:
{
"timestamp": <long>,
"name" : <string - metric name>,
"value" : <float>
}
I want to show this data in our UI widgets however the query might bring way too much data for the widget so I went with bucket aggregation that will calculate the average value per bucket and will bring the "calculated" representatives from the time series. Here is a slightly simplified query of what I'm doing
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"name": "METRICS_NAME_COMES_HERE"
}
},
{
"range": {
"timestamp": {
"gte": {{from}},
"lt": {{to}}
}
}
}
]
}
},
"aggs": {
"primary-agg": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "{{bucket_size}}ms",
"min_doc_count" : 1,
"offset": "{{offset_in_ms}}ms"
},
"aggs": {
"average-value": {
"avg": {
"field": "value"
}
}
}
}
}
}
Now when the time range changes (we have a kibana-like time picker in our ui widget that allows to change the time range translated to 'from'/'to' in the query), the bucket data gets recalculated and it may bring to significant data discrepancy shown in UI.
For example if from UI I see a "spike" of data, and zoom (thus narrowing down the search period) the spike is preserved but the actual values of the "representatives" are changed significantly.
So my question is what are the best practices to create a query that produces the fixed number of results (therefor I understand that I need some kind of aggregation) but the values are not affected by the range changes?

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?
The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

Which is the most effective way to get all the results of aggregation

I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.
I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?
If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....
You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Concurrent events aggregation in ElasticSearch

I have a number of documents representing events with starts_at and ends_at fields. At a given point in time, an event is considered active, if the point in question is after starts_at and before ends_at.
I'm looking for an aggregation, which should result in a date histogram, where each bucket contains the number of active events in that interval.
So far, the best approximation I have found is to create a set of buckets counting the number of starts in each interval, as well as a corresponding set of buckets counting the number of ends, and then postprocessing them by subtracting the number of starts from the number of ends for each interval:
{
"size": "0",
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"_type": "event"
}
},
{
"range": {
"starts_at": {
"gte": "2015-06-14T05:25:03Z",
"lte": "2015-06-21T05:25:03Z"
}
}
}
]
}
}
},
"aggs": {
"starts": {
"date_histogram": {
"field": "starts_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
},
"ends": {
"date_histogram": {
"field": "ends_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
}
}
}
I'm looking for something like this solution.
Is there a way to achieve that with a single query?
I'm not 100% sure but up-coming pipeline aggregations might solve this problem in near-future in a more elegant way.
Meanwhile you could choose the desired time resolution and at index time in addition to starts_at and ends_at fields you would also generate active_at field. It would be an array of time stamps and you could use either terms (if it is mapped as not_analyzed string) or date_histogram aggregation to get the correct "active events count" for each time-bucket.
The down-side is inflated storage requirements and possibly worse performance since there are more field values to aggregate over. Anyway it shouldn't be too bad if you don't choose a too high time resolution like 1 minute.

Resources