Range query in elasticsearch does not work properly - elasticsearch

I have an index that contains objects eventvalue-eventtime. I want to write a query that will return aggregated event count based on eventvalue for the last 30 seconds. Also, I need empty buckets if for a given seconds there was no events - I need to display this data on a graph.
So I wrote the following query:
{
"query" : {
"bool" : {
"must" : [
{
"range" : {
"eventtime" : {
"gte" : "now-30s/s",
"lte" : "now/s",
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
},
{
"range" : {
"eventvalue" : {
"lte" : 3
}
}
}
]
}
},
"aggs": {
"values_agg": {
"terms": {
"field": "eventvalue",
"min_doc_count" : 0,
"order": {
"_term": "asc"
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}
}
This query is not working properly and I don't know why. Specifically, the first "range" query gives me desired interval (if I remove it I'm getting values from all time). But the second "range" query seems to have no effect. Eventvalue can be anywhere from 1 to 10 and the desired effect is that I will have three buckets for eventvalues 1-3. However, I get all 10 buckets with all events.
How can I fix this query so it still returns empty buckets but only for selected evenvalues?

I believe you need to remove the "min_doc_count": 0 from your terms aggregation. To achieve the empty buckets you're aiming for, you need only use min_doc_count in the date_histogram aggregation.
Per the documentation for the terms aggregation:
Setting min_doc_count=0 will also return buckets for terms that didn’t
match any hit.
This explains why you are seeing buckets for eventvalues that are greater than 3. They were filtered out by the query, but brought back in by the terms aggregation.
UPDATE
Since there is a possibility that the eventvalues may not exist anywhere in the 30sec time slice, the other approach I would recommend is to manually specify the discrete values you want to use as buckets using a filters aggregation. See the documentation here.
Try using this for your aggregations:
"aggs": {
"values_agg": {
"filters": {
"filters": {
"1": { "term": { "eventvalue": 1 }},
"2": { "term": { "eventvalue": 2 }},
"3": { "term": { "eventvalue": 3 }}
}
},
"aggs": {
"events_over_time" : {
"date_histogram" : {
"field" : "eventtime",
"interval" : "1s",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "now-30s/s",
"max" : "now/s"
},
"format" : "yyyy-MM-dd HH:mm:ss",
"time_zone": "+03:00"
}
}
}
}
}

Related

How to filter by sub-aggregated results in Elasticsearch

I've got the following elastic search query in order to get the number of product sales per hour grouped by product id and hour of sale.
POST /my_sales/_search?size=0
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
}
}
}
}
}
}
One example of data :
{
"#timestamp" : "2020-10-29T18:09:56.921Z",
"name" : "my-beautifull_product",
"event_time" : "2020-10-17T08:01:33.397Z"
}
This query returns several buckets (one per hour and per product) but i would like to only retrieve those who have a doc_count higher than 10 for example, is it possible ?
For those results i would like to know the id of the product and the event_time bucket.
Thanks for your help.
Perhaps using the Bucket Selector feature will help on filtering out the results.
Try out this below search query:
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 10"
}
}
}
}
}
}
}
}
It will filter out all the documents, whose count is greater than 10 based on "params.the_doc_count > 10"
Thank you for your help this is not far from what i would like but not exactly ; with the bucket selector i have something like this :
"aggregations" : {
"sales_per_hour" : {
"buckets" : [
{
"key_as_string" : "2020-08-31:23:00",
"key" : 1598914800000,
"doc_count" : 16,
"sales_per_hour_per_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "my_product_1",
"doc_count" : 2
},
{
"key" : "my_product_2",
"doc_count" : 2
},
{
"key" : "myproduct_3",
"doc_count" : 12
}
]
}
}
]
}
And sometimes none of the buckets are greater than 10, is it possible to have the same thing but with the filter on _count applied to the second level aggregation (sales_per_hour_per_product) and not on the first level (sales_per_hour) ?

Elastic script from buckets and higher level aggregation

I want to compare the daily average of a metric (the frequency of words appearing in texts) to the value of a specific day. This is during a week. My goal is to check whether there's a spike. If the last day is way higher than the daily average, I'd trigger an alarm.
So from my input in Elasticsearch I compute the daily average during the week and find out the value for the last day of that week.
For getting the daily average for the week, I simply cut a week's worth of data using a range query on date field, so all my available data is the given week. I compute the sum and divide by 7 for a daily average.
For getting the last day's value, I did a terms aggregation on the date field with descending order and size 1 as suggested in a different question (How to select the last bucket in a date_histogram selector in Elasticsearch)
The whole output is as follows. Here you can see words "rama0" and "rama1" with their corresponding frequencies.
{
"aggregations" : {
"the_keywords" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
{
"key" : "rama1",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
[...]
]
}
}
}
Now I have the_daily_average in a high level of the output, and the_last_day_frequency in the single-element buckets list in the_last_day aggregation. I cannot use a bucket_script to compare those, because I cannot refer to a single bucket (if I place the script outside the_last_day aggregation) and I cannot refer to higher-level aggregations if I place the script inside the_last_day.
IMO the reasonable thing to do would be to put the script outside the aggregation and use a buckets_path using the <AGG_NAME><MULTIBUCKET_KEY> syntax mentioned in the docs, but I have tried "var1": "the_last_day[1580169600000]>the_last_day_frequency" and variations (hardcoding first until it works), but I haven't been able to refer to a particular bucket.
My ultimate goal is to have a list of keywords for which the last day frequency greatly exceeds the daily average.
For anyone interested, my current query is as follows. Notice that the part I'm struggling with is commented out.
body='{
"query": {
"range": {
"date": {
"gte": "START",
"lte": "END"
}
}
},
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average" : {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
}/*,
"the_spike": {
"bucket_script": {
"buckets_path": {
"last_day_frequency": "the_last_day>the_last_day_frequency",
"daily_average": "the_daily_average"
},
"script": {
"inline": "return last_day_frequency / daily_average"
}
}
}*/
}
}
}
}'
In your query the_last_day>the_last_day_frequency points to a bucket not a single value so it is throwing error. You need to get single metric value from "the_last_day_frequency", you can achieve it using max_bucket. Then you can use bucket_Selector aggregation to compare last day value with average value
Query:
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average": {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
},
"max_frequency_last_day": {
"max_bucket": {
"buckets_path": "the_last_day>the_last_day_frequency"
}
},
"the_spike": {
"bucket_selector": {
"buckets_path": {
"last_day_frequency": "max_frequency_last_day",
"daily_average": "the_daily_average"
},
"script": {
"inline": "params.last_day_frequency > params.daily_average"
}
}
}
}
}
}
````

How to group documents by hours in elastic search aggregation?

I tried to group my document by hours for a day through aggregation but always get exception "expected field name but got [START_OBJECT]"? What's the problem?
{
"query" : {
"bool" : {
"must" : {
"range" : {
"timestamp" : {
"from" : "2017-08-14 00:00:00",
"to" : "2017-08-15 00:00:00",
"include_lower" : true,
"include_upper" : true
}
}
}
}
},
"aggs": {
"result_by_hours": {
"histogram": {
"script": "doc.timestamp.date.getHourOfDay()",
"interval": 1
}
}
}
}
What I expect is to return the number of documents for each hour on yesterday. How can I use dynamic real time instead of "2017-08-14 - 2017-08-15"?
Thanks in advance:)
Depending on ES version, you can use range filter/query relative to "now", ex now-1d/d will go 1 day back in time.
See examples at https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
As for the aggs you can also group by interval of for instance an hour using date_histogram with interval
In ES 5.5:
Query:
"range" : {
"timestamp" : {
"gte" : "now-1d/d,
"lte" : "now/d"
}
}
Aggs:
{
"aggs" : {
"values_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "1h"
}
}
}
}

Timeseries histogram of data with Elasticsearch

I have a list of documents organized as followed:
{
"date": "2010-12-12" // Some valid datetime string
"category": "some_category" // This can be any string
}
I need to create a frequency distribution for the data within buckets of time. I have looked at the date_histogram API but that only gets me halfway there.
{
"size": 0,
"aggs" : {
"my_search" : {
"date_histogram" : {
"field" : "date",
"interval" : "1s"
}
}
}
}
Which returns me the count of my data that falls into all 1 second buckets. Within those 1 second buckets, I also need to aggregate all of the data into type category buckets, such that I'm left with buckets of time with counts of category within each bucket. Is there a built in method to do this?
You're on the right path, you simply need to add another terms sub-aggregation for the category field:
{
"size": 0,
"aggs" : {
"my_search" : {
"date_histogram" : {
"field" : "date",
"interval" : "1s"
},
"aggs": {
"categories": {
"terms": {
"field": "category"
}
}
}
}
}
}

Converting SQL query to ElasticSearch Query

I want to convert the following sql query to Elasticsearch one. can any one help in this.
select csgg, sum(amount) from table1
where type in ('a','b','c') and year=2016 and fc="33" group by csgg having sum(amount)=0
I tried following way:enter code here
{
"size": 500,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}}
],
"should" : [
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg"
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
but not sure if I am doing right as its not validating the results.
seems query to be added inside aggregation.
Assuming that you use Elasticsearch 2.x, there is a possibility to have the having-semantics in Elasticsearch.
I'm not aware of a possibility prior 2.0.
You can use the new Pipeline Aggregation Bucket Selector Aggregation, which only selects the buckets, which meet a certain criteria:
POST test/test/_search
{
"size": 0,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{"term" : {"fc" : "33"}},
{"term" : {"year" : 2016}},
{"terms" : {"type" : ["a","b","c"] }}
]
}
}
}
},
"aggs": {
"group_by_csgg": {
"terms": {
"field": "csgg",
"size": 100
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"no_amount_filter": {
"bucket_selector": {
"buckets_path": {"sumAmount": "sum_amount"},
"script": "sumAmount == 0"
}
}
}
}
}
}
However there are two caveats. Depending on your configuration, it might be necessary to enable scripting like that:
script.aggs: true
script.groovy: true
Moreover, as it works on the parent buckets it is not guaranteed that you get all buckets with amount = 0. If the terms aggregation selects only terms with sum amount != 0, you will have no result.

Resources