How to use cumulative_sum with a previous aggregation?

How to use cumulative_sum with a previous aggregation? - elasticsearch

I would like to plot a cumulative sum of some events, per day. The cumulative sum aggregation seems to be the way to go so I tried to reuse the example given in the docs.
The first aggregation works fine, the following query
{
"aggs": {
"vulns_day" : {
"date_histogram" :{
"field": "HOST_START_iso",
"interval": "day"
}
}
}
}
gives replies such as
(...)
{
"key_as_string": "2016-09-08T00:00:00.000Z",
"key": 1473292800000,
"doc_count": 76330
},
{
"key_as_string": "2016-09-09T00:00:00.000Z",
"key": 1473379200000,
"doc_count": 37712
},
(...)
I then wanted to query the cumulative sum of doc_count above via
{
"aggs": {
"vulns_day" : {
"date_histogram" :{
"field": "HOST_START_iso",
"interval": "day"
}
},
"aggs": {
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "doc_count"
}
}
}
}
}
but it gives an error:
"reason": {
"type": "search_parse_exception",
"reason": "Could not find aggregator type [vulns_cumulated] in [aggs]",
I see that bucket_path should point to the elements to be summed and the example for cumulative aggregations created a specific intermediate sum but I do not have anything to sum (beside doc_count).

I guess, you should change your query like this:
{
"aggs": {
"vulns_day": {
"date_histogram": {
"field": "HOST_START_iso",
"interval": "day"
},
"aggs": {
"document_count": {
"value_count": {
"field": "HOST_START_iso"
}
},
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "document_count"
}
}
}
}
}
}

I found the solution. Since doc_count did not seem to be available, I tried to retrieve stats for the time parameter, and use its count value. It worked:
{
"size": 0,
"aggs": {
"vulns_day": {
"date_histogram": {
"field": "HOST_START_iso",
"interval": "day"
},
"aggs": {
"dates_stats": {
"stats": {
"field": "HOST_START_iso"
}
},
"vulns_cumulated": {
"cumulative_sum": {
"buckets_path": "dates_stats.count"
}
}
}
}
}
}

Related

Elasticsearch: How set 'doc_count' of a FILTER-Aggregation in relation to total 'doc_count'

A seemingly very trivial problem prompted me today to read the Elasticsearch documentation again diligently. So far, however, I have not come across the solution....
Question:
is ther's a simple way to set the doc_count of a filter aggregation in relation to the total doc_count?
Here's a snippet from my search-request-json.
In the feature_occurrences aggregation I filtered documents.
Now I want to calculate the ratio filtered/all Docs in each time bucket.
GET my_index/_search
{
"aggs": {
"time_buckets": {
"date_histogram": {
"field": "date",
"calendar_interval": "1d",
"min_doc_count": 0
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"x": "y"
}
}
},
"feature_occurrences_per_doc" : {
// feature_occurences.doc_count / doc_count
}
Any Ideas ?

You can use bucket_script to calc the ratio:
{
"aggs": {
"date": {
"date_histogram": {
"field": "#timestamp",
"interval": "hour"
},
"aggs": {
"feature_occurrences": {
"filter": {
"term": {
"cloud.region": "westeurope"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"doc_count": "_count",
"features_count": "feature_occurrences._count"
},
"script": "params.features_count / params.doc_count"
}
}
}
}
}
}
Elastic bucket script doc:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html

How to define percentage of result items with specific field in Elasticsearch query?

I have a search query that returns all items matching users that have type manager or lead.
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [
{
"terms": {
"type": ["manager", "lead"]
}
}
]
}
}
}
Is there a way to define what percentage of the results should be of type "manager"?
In other words, I want the results to have 80% of users with type manager and 20% with type lead.

I want to make a suggestion to use bucket_path aggregation. As I know this aggregation needs to be run in sub-aggs of a histogram aggregation. As you have such field in your mapping so I think this query should work for you:
{
"size": 0,
"aggs": {
"NAME": {
"date_histogram": {
"field": "my_datetime",
"interval": "month"
},
"aggs": {
"role_type": {
"terms": {
"field": "type",
"size": 10
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"role_1_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_1 / (params.role_1+params.role_2)*100"
}
},
"role_2_ratio": {
"bucket_script": {
"buckets_path": {
"role_1": "role_type['manager']>count",
"role_2": "role_type['lead']>count"
},
"script": "params.role_2 / (params.role_1+params.role_2)*100"
}
}
}
}
}
}
Please let me know if it didn't work well for you.

ElasticSearch: Nested buckets aggregation

I'm new to ElasticSearch, so this question could be quite trivial for you, but here I go:
I'm using kibana_sample_data_ecommerce, which documents have a mapping like this
{
...
"order_date" : <datetime>
"taxful_total_price" : <double>
...
}
I want to get a basic daily behavior of the data:
Expecting documents like this:
[
{
"qtime" : "00:00",
"mean" : 20,
"std" : 40
},
{
"qtime" : "01:00",
"mean" : 150,
"std" : 64
},
...
]
So, the process I think that I need to do is:
Group by day all records ->
Group by time window for each day ->
Sum all record in each time window ->
Cumulative Sum for each sum by time window, thus, I get behavior of a day ->
Extended_stats by the same time window across all days
And that can be expressed like this:
But I can't unwrap those buckets to process those statistics. May you give me some advice to do that operation and get that result?
Here is my current query(kibana developer tools):
POST kibana_sample_data_ecommerce/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"order_date": {
"gt": "now-1M",
"lte": "now"
}
}
}
]
}
},
"aggs": {
"day_histo": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "day"
},
"aggs": {
"qmin_histo": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "hour"
},
"aggs": {
"qminute_sum": {
"sum": {
"field": "taxful_total_price"
}
},
"cumulative_qminute_sum": {
"cumulative_sum": {
"buckets_path": "qminute_sum"
}
}
}
}
}
}
}
}

Here's how you pull off the extended stats:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"order_date": {
"gt": "now-4M",
"lte": "now"
}
}
}
]
}
},
"aggs": {
"by_day": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "day"
},
"aggs": {
"by_hour": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "hour"
},
"aggs": {
"by_taxful_total_price": {
"extended_stats": {
"field": "taxful_total_price"
}
}
}
}
}
}
}
}
yielding

Elasticsearch Aggregations: Only return results of one of them?

I'm trying to find a way to only return the results of one aggregation in an Elasticsearch query. I have a max bucket aggregation (the one that I want to see) that is calculated from a sum bucket aggregation based on a date histogram aggregation. Right now, I have to go through 1,440 results to get to the one I want to see. I've already removed the results of the base query with the size: 0 modifier, but is there a way to do something similar with the aggregations as well? I've tried slipping the same thing into a few places with no luck.
Here's the query:
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
}
}
},
"max_transactions_per_minute": {
"max_bucket": {
"buckets_path": "hits_per_minute>total_hits"
}
}
}
}

Fortunately enough, you can do that with bucket_sort aggregation, which was added in Elasticsearch 6.4.
Do it with bucket_sort
POST my_index/doc/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2018-11-28",
"lte": "2018-11-28"
}
}
},
"aggs": {
"hits_per_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "minute"
},
"aggs": {
"total_hits": {
"sum": {
"field": "hits_count"
}
},
"max_transactions_per_minute": {
"bucket_sort": {
"sort": [
{"total_hits": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
This will give you a response like this:
{
...
"aggregations": {
"hits_per_minute": {
"buckets": [
{
"key_as_string": "2018-11-28T21:10:00.000Z",
"key": 1543957800000,
"doc_count": 3,
"total_hits": {
"value": 11
}
}
]
}
}
}
Note that there is no extra aggregation in the output and the output of hits_per_minute is truncated (because we asked to give exactly one, topmost bucket).
Do it with filter_path
There is also a generic way to filter the output of Elasticsearch: Response filtering, as this answer suggests.
In this case it will be enough to just do the following query:
POST my_index/doc/_search?filter_path=aggregations.max_transactions_per_minute
{ ... (original query) ... }
That would give the response:
{
"aggregations": {
"max_transactions_per_minute": {
"value": 11,
"keys": [
"2018-12-04T21:10:00.000Z"
]
}
}
}

How to display only the key from the bucket

I have an index with millions of documents. Suppose each of my documents has some code, and I need to find the list of codes matching some criteria. The only way I found doing that, is using whole lot of aggregations, so I created an ugly query which does exactly what I want:
POST my-index/_search
{
"query": {
"range": {
"timestamp": {
"gte": "2017-08-01T00:00:00.000",
"lt": "2017-08-08T00:00:00.000"
}
}
},
"size": 0,
"aggs": {
"codes": {
"terms": {
"field": "code",
"size": 10000
},
"aggs": {
"days": {
"date_histogram": {
"field": "timestamp",
"interval": "day",
"format": "dd"
},
"aggs": {
"hours": {
"date_histogram": {
"field": "timestamp",
"interval": "hour",
"format": "yyyy-MM-dd:HH"
},
"aggs": {
"hour_income": {
"sum": {
"field": "price"
}
}
}
},
"max_income": {
"max_bucket": {
"buckets_path": "hours>hour_income"
}
},
"day_income": {
"sum_bucket": {
"buckets_path": "hours.hour_income"
}
},
"more_than_sixty_percent": {
"bucket_script": {
"buckets_path": {
"dayIncome": "day_income",
"maxIncome": "max_income"
},
"script": "params.maxIncome - params.dayIncome * 60 / 100 > 0 ? 1 : 0"
}
}
}
},
"amount_of_days": {
"sum_bucket": {
"buckets_path": "days.more_than_sixty_percent"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"amountOfDays": "amount_of_days"
},
"script": "params.amountOfDays >= 3"
}
}
}
}
}
}
The response I get is a few millions lines of JSON, consisting of buckets. Each bucket has more than 700 lines (and buckets of its own), but all I need is its key, so that I have my list of codes. I guess it's not good having a response a few thousand times larger than neccessary, and there might be problems with parsing. So I wanted to ask, is there any way to hide the other info in the bucket and get only the keys?
Thanks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use cumulative_sum with a previous aggregation? - elasticsearch

Related

Elasticsearch: How set 'doc_count' of a FILTER-Aggregation in relation to total 'doc_count'

How to define percentage of result items with specific field in Elasticsearch query?

ElasticSearch: Nested buckets aggregation

Elasticsearch Aggregations: Only return results of one of them?

How to display only the key from the bucket

Categories

Resources