Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket - elasticsearch

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?

If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....

You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Related

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

How do I query last 1 hour of data and order it based on time?

I have data in ES such as:
#timestamp --> Timestamp field
record.hostIP
record.destIP
record.port
record.application
etc...
I would like to plot this on a graph in js and hence need time on the X axis and count of record.<> on the Y axis.
The query below gets me docs sorted by timestamp vs count (of all documents).
What do I want to do if I need count of record.application in the last 1 hour, sorted by timestamp from earliest to latest?
GET _search
{
"size": "0",
"aggs": {
"oneHourTimeRange": {
"filter": {
"range": {
"#timestamp": {
"gte": "now-60m",
"lte": "now"
}
}
},
"aggs": {
"totalTraffic": {
"terms": {
"field": "#timestamp",
"size": 500,
"order": { "_key": "asc" }
}
}
}
}
}
}
Thanks.
Do you mean unique count of record.application? You would probably want cardinality aggregation. For the aggregate, you nest a cardinality aggregation inside a date histogram, and it should give you what you want. You should move the filter condition outside and not a part of aggregation.

Pipeline aggregations in ElasticSearch 1.5

I'm wondering if it is, in any way, possible to make ES run aggregations on other aggregations all in the same query?
Basically, that's called pipelining.
I'm talking about ElasticSearch 1.5, yes I know, that's unfortunate but I'm stuck with AWS and that's what they're selling, I have to live with that.
I'm guessing that is not possible, so I'll write the next phase of the question right away.
Assuming I can query ES multiple times based on results from previous queries, how would you do the following:
Have a list of the top 100 tags that sorted by the number of appearances in the documents? (I have a field tags for each record, I'd like to know which tags are the most common) - in the past hour.
Having that, for each of the 100 tags; have the number of appearances split by 1-hour buckets (denote by Y the number representing the last hour).
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Thank you for helping !!!
Basically, that's called pipelining.
No. Pipeline Aggregations did not appear until Elasticsearch 2.0. For what it's worth, Elastic does offer its own ESaaS offering with Elastic Cloud. It also runs on AWS.
... how would you do the following
The first two follow more of a flow of scope rather than working on the values.
{
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-1h"
}
}
}
}
}
}
This will give you the last hour of data.
{
"size": 0,
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
This will give you the top 100 tags for all time.
If you put them together, then you get the top 100 tags in the past hour.
For the second request, it sounds like you want a mix of that, but you also want more than just the last hour.
Whenever performing an aggregation (or GROUP BY query for that matter), you need to think about incremental steps. If you want to group by hour, then do something, then that's the order that it needs to happen in. So it's not a matter of "now that I have the last hour, let's get the other hours too". Once you've narrowed you window (scope), then you can't go back in general.
So to get number 2, we need to look at it differently. Group by as many hours as you're interested in looking at (how many 1-hour buckets do you want), then get those and then get the count per bucket. I'll take a guess and say that you want 24, 1-hour buckets (note 24 * 100 is 2400, which is not insignificant!).
That's a lot of buckets, so maybe we can think about the question differently.
I want the last hour results of top 100.
I want all top 100 average for X time (where you define X, and having it reduced will make it faster, but naturally limited to the window of selection). By limiting with the filter, we reduce the scope of the overall aggregation:
This may look like this:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"range" : {
"timestamp": {
"gte": "now-24h"
}
}
}
}
},
"aggs": {
"group_by_hour_and_day": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
},
"aggs": {
"group_by_tag": {
"field": "tag",
"size": 100
}
}
}
}
}
The problem with this request is that it gives you now-24 to now-1h, then now-1h to now. That's pretty loosely what you requested, but it doesn't give it by term (which may or may not matter). Instead, the term is given by time instead (again, steps/order matters). You can then say that the previous 24h average is the responding doc count of the wider window, divided by the window size (23 in this case for 23 hours). If you want to include the last hour in the average, then you can change "to": "now-1h" to "to": "now".
We can perhaps flip this to give us the answer differently, but with a little bit more effort (where query still limits by the max time range to consider):
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
},
"aggs": {
"group_by_range": {
"field": "timestamp",
"ranges": [
{ "from": "now-1h" },
{ "to": "now-1h" }
]
}
}
}
}
}
Notice that now we aggregate by tag first across the full scope. You could remove the second date_range aggregation as a result because you now have the total for the time window. The problem with this approach is that you could end up with a very popular tag in the last hour that is not popular enough in the past full range, and so it won't appear at all.
The solution to that is to add an extra step unfortunately, by making two top-level aggregations. One for the top 100 in the full scope and one for the top 100 in the last hour.
{
"size": 0,
"query": { ... },
"aggs": {
"group_by_tag": {
"terms": {
"field": "tag",
"size": 100
}
},
"group_by_last_hour": {
"filter": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"aggs": {
"terms": {
"field": "tag",
"size": 100
}
}
}
}
}
This gives the top 100 for the full window -- whatever that might be -- and then it also separately gives the top 100 for the last hour.
Then, calculate the by how many percents, Y deviates from the average value of all the other 1-hour buckets.
Do this on the client side based on whichever form you care to use, and calculate the average by cross-comparing.
And considering the type of query, you should then cache the result, which allows you to play with larger window sizes than might be otherwise desirable.

Using Date Histogram in Elasticsearch to count sequential activity

I am indexing Tomcat access-log data into Elasticsearch (1.7.3).
The documents that I deal with have the concept of duration, represented as end time and duration in millisec
(start time can be calculated, though I can store it as well, if it helps solve my problem).
For example:
{
ztime: "10-17-2015T04:05:00.000+02:00",
duration: 4500,
thred: "http-nio-8080-exec-14"
},
{
ztime: "10-17-2015T04:07:42.227+02:00",
duration: 3100,
thred: "http-nio-8080-exec-25"
}
My goal is to produce a histogram where I show for each second how many threads existed.
I thought of using a date_histogram that will aggregate my docs into 1 sec buckets.
GET /mindex/mtype/_search?search_type=count
{
"aggs": {
"threads_per_hr": {
"date_histogram": {
"field": "ztime",
"interval": "1s",
"min_doc_count": 1
},
"aggs": {
"per_hr_threads": {
"cardinality": {
"field": "thread"
}
}
}
}
}
}
however, thus each thread will be bucketized only once.
What I need is for each doc to be bucketized into several buckets.
For example, I will need the first document to be bucketized into the 04:05:00.000, 04:05:01.000, 04:05:02.000, 04:05:03.000 buckets.
What kind of query (Java API and/or REST API) would help me achieve this goal?
You need to use cardinality aggregation here. It gives the number of unique values for the field.
GET /{index}/{type}/_search?search_type=count
{
"aggs": {
"threads_per_hr": {
"date_histogram": {
"field": "ztime",
"interval": "1s",
"min_doc_count": 0
},
"aggs": {
"per_hr_threads": {
"cardinality": {
"field": "thread"
}
}
}
}
}
}

Concurrent events aggregation in ElasticSearch

I have a number of documents representing events with starts_at and ends_at fields. At a given point in time, an event is considered active, if the point in question is after starts_at and before ends_at.
I'm looking for an aggregation, which should result in a date histogram, where each bucket contains the number of active events in that interval.
So far, the best approximation I have found is to create a set of buckets counting the number of starts in each interval, as well as a corresponding set of buckets counting the number of ends, and then postprocessing them by subtracting the number of starts from the number of ends for each interval:
{
"size": "0",
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"_type": "event"
}
},
{
"range": {
"starts_at": {
"gte": "2015-06-14T05:25:03Z",
"lte": "2015-06-21T05:25:03Z"
}
}
}
]
}
}
},
"aggs": {
"starts": {
"date_histogram": {
"field": "starts_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
},
"ends": {
"date_histogram": {
"field": "ends_at",
"interval": "15m",
"extended_bounds": {
"max": "2015-06-21T05:25:04Z",
"min": "2015-06-14T05:25:04Z"
},
"min_doc_count": 0
}
}
}
}
I'm looking for something like this solution.
Is there a way to achieve that with a single query?
I'm not 100% sure but up-coming pipeline aggregations might solve this problem in near-future in a more elegant way.
Meanwhile you could choose the desired time resolution and at index time in addition to starts_at and ends_at fields you would also generate active_at field. It would be an array of time stamps and you could use either terms (if it is mapped as not_analyzed string) or date_histogram aggregation to get the correct "active events count" for each time-bucket.
The down-side is inflated storage requirements and possibly worse performance since there are more field values to aggregate over. Anyway it shouldn't be too bad if you don't choose a too high time resolution like 1 minute.

Resources