Pipeline aggregations in ElasticSearch 1.5 - elasticsearch

I'm wondering if it is, in any way, possible to make ES run aggregations on other aggregations all in the same query?
Basically, that's called pipelining.
I'm talking about ElasticSearch 1.5, yes I know, that's unfortunate but I'm stuck with AWS and that's what they're selling, I have to live with that.
I'm guessing that is not possible, so I'll write the next phase of the question right away.
Assuming I can query ES multiple times based on results from previous queries, how would you do the following:
Have a list of the top 100 tags that sorted by the number of appearances in the documents? (I have a field tags for each record, I'd like to know which tags are the most common) - in the past hour.
Having that, for each of the 100 tags; have the number of appearances split by 1-hour buckets (denote by Y the number representing the last hour).
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Thank you for helping !!!

Basically, that's called pipelining.
No. Pipeline Aggregations did not appear until Elasticsearch 2.0. For what it's worth, Elastic does offer its own ESaaS offering with Elastic Cloud. It also runs on AWS.
... how would you do the following
The first two follow more of a flow of scope rather than working on the values.
{
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-1h"
}
}
}
}
}
}
This will give you the last hour of data.
{
"size": 0,
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
This will give you the top 100 tags for all time.
If you put them together, then you get the top 100 tags in the past hour.
For the second request, it sounds like you want a mix of that, but you also want more than just the last hour.
Whenever performing an aggregation (or GROUP BY query for that matter), you need to think about incremental steps. If you want to group by hour, then do something, then that's the order that it needs to happen in. So it's not a matter of "now that I have the last hour, let's get the other hours too". Once you've narrowed you window (scope), then you can't go back in general.
So to get number 2, we need to look at it differently. Group by as many hours as you're interested in looking at (how many 1-hour buckets do you want), then get those and then get the count per bucket. I'll take a guess and say that you want 24, 1-hour buckets (note 24 * 100 is 2400, which is not insignificant!).
That's a lot of buckets, so maybe we can think about the question differently.
I want the last hour results of top 100.
I want all top 100 average for X time (where you define X, and having it reduced will make it faster, but naturally limited to the window of selection). By limiting with the filter, we reduce the scope of the overall aggregation:
This may look like this:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-24h"
}
}
}
}
},
"aggs": {
"group_by_hour_and_day": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
},
"aggs": {
"group_by_tag": {
"field": "tag",
"size": 100
}
}
}
}
}
The problem with this request is that it gives you now-24 to now-1h, then now-1h to now. That's pretty loosely what you requested, but it doesn't give it by term (which may or may not matter). Instead, the term is given by time instead (again, steps/order matters). You can then say that the previous 24h average is the responding doc count of the wider window, divided by the window size (23 in this case for 23 hours). If you want to include the last hour in the average, then you can change "to": "now-1h" to "to": "now".
We can perhaps flip this to give us the answer differently, but with a little bit more effort (where query still limits by the max time range to consider):
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
},
"aggs": {
"group_by_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
}
}
}
}
}
Notice that now we aggregate by tag first across the full scope. You could remove the second date_range aggregation as a result because you now have the total for the time window. The problem with this approach is that you could end up with a very popular tag in the last hour that is not popular enough in the past full range, and so it won't appear at all.
The solution to that is to add an extra step unfortunately, by making two top-level aggregations. One for the top 100 in the full scope and one for the top 100 in the last hour.
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
},
"group_by_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
}
This gives the top 100 for the full window -- whatever that might be -- and then it also separately gives the top 100 for the last hour.
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Do this on the client side based on whichever form you care to use, and calculate the average by cross-comparing.
And considering the type of query, you should then cache the result, which allows you to play with larger window sizes than might be otherwise desirable.

Related

Bucket aggregation that doesn't depend on the time range in Elasticsearch

I'm using Elasticsearch 7.9.3 to query time series data metrics which are stored in a form of:
{
"timestamp": <long>,
"name" : <string - metric name>,
"value" : <float>
}
I want to show this data in our UI widgets however the query might bring way too much data for the widget so I went with bucket aggregation that will calculate the average value per bucket and will bring the "calculated" representatives from the time series. Here is a slightly simplified query of what I'm doing
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"name": "METRICS_NAME_COMES_HERE"
}
},
{
"range": {
"timestamp": {
"gte": {{from}},
"lt": {{to}}
}
}
}
]
}
},
"aggs": {
"primary-agg": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "{{bucket_size}}ms",
"min_doc_count" : 1,
"offset": "{{offset_in_ms}}ms"
},
"aggs": {
"average-value": {
"avg": {
"field": "value"
}
}
}
}
}
}
Now when the time range changes (we have a kibana-like time picker in our ui widget that allows to change the time range translated to 'from'/'to' in the query), the bucket data gets recalculated and it may bring to significant data discrepancy shown in UI.
For example if from UI I see a "spike" of data, and zoom (thus narrowing down the search period) the spike is preserved but the actual values of the "representatives" are changed significantly.
So my question is what are the best practices to create a query that produces the fixed number of results (therefor I understand that I need some kind of aggregation) but the values are not affected by the range changes?

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How to control the elasticsearch aggregation results with From / Size?

I have been trying to add pagination in elasticsearch term aggregation. In query we can add the pagination like,
{
"from": 0, // to add the start to control the pagination
"size": 10,
"query": { }
}
this is pretty clear, but when I want to add pagination to aggregation, I read a lot about it, but I couldn't find anything, My code looks like this,
{
"from": 0,
"size": 0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"size": 20
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Is there any way to create pagination with a function or any other suggestions?
Seems like you probably want partitions. From the docs:
Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request.
Basically you add "include": { "partition": n, "num_partitions": x },, where n is the page and x is the number of pages.
Unfortunately this feature was added fairly recently. If the tags can be believed on the GitHub Issue which spawned this feature, you'll need to be on at least Elasticsearch 5.2 or better.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?
If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....
You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Concurrent events aggregation in ElasticSearch

I have a number of documents representing events with starts_at and ends_at fields. At a given point in time, an event is considered active, if the point in question is after starts_at and before ends_at.
I'm looking for an aggregation, which should result in a date histogram, where each bucket contains the number of active events in that interval.
So far, the best approximation I have found is to create a set of buckets counting the number of starts in each interval, as well as a corresponding set of buckets counting the number of ends, and then postprocessing them by subtracting the number of starts from the number of ends for each interval:
{
"size": "0",
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"_type": "event"
}
},
{
"range": {
"starts_at": {
"gte": "2015-06-14T05:25:03Z",
"lte": "2015-06-21T05:25:03Z"
}
}
}
]
}
}
},
"aggs": {
"starts": {
"date_histogram": {
"field": "starts_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
},
"ends": {
"date_histogram": {
"field": "ends_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
}
}
}
I'm looking for something like this solution.
Is there a way to achieve that with a single query?
I'm not 100% sure but up-coming pipeline aggregations might solve this problem in near-future in a more elegant way.
Meanwhile you could choose the desired time resolution and at index time in addition to starts_at and ends_at fields you would also generate active_at field. It would be an array of time stamps and you could use either terms (if it is mapped as not_analyzed string) or date_histogram aggregation to get the correct "active events count" for each time-bucket.
The down-side is inflated storage requirements and possibly worse performance since there are more field values to aggregate over. Anyway it shouldn't be too bad if you don't choose a too high time resolution like 1 minute.

Resources