Histogram over fixed range of dates (i.e. fixed number of buckets) even when data is absent - elasticsearch

My goal is to build a histogram between a start and an end dates, the empty dates should appear in the histogram and have zero as a count value.
I am trying the following query to fetch the last 7 days:
POST my_index/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-7d/d",
"lte": "now/d"
}
}
},
"aggs" : {
"count_per_day" : {
"date_histogram" : {
"field" : "date",
"interval" : "day",
"order": {"_key": "desc"},
"min_doc_count": 0
}
}
}
}
The issues is that I have data only for the last 3 days, so there is no data at all prior to 3 days ago. In this case, the result contains only the last 3 days and the previous days are not returned at all.
But if there is a gap (i.e. there is data 6 days ago, but no data in the 5th and the 4th day), the empty days will appear with zero as a count.
How can I force to return the absent dates even if there is no data?
In other word, how to fix the number of buckets (to 7 in the example above) even if there is no data?

You have already added "min_doc_count": 0 to include empty buckets. All you need to do is to simply add extended_bounds param as well to force starting and ending buckets. More on it can be found here.
Update your query as below:
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-7d/d",
"lte": "now/d"
}
}
},
"aggs": {
"count_per_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"order": {
"_key": "desc"
},
"min_doc_count": 0,
"extended_bounds": {
"min": "now-7d/d",
"max": "now/d"
}
}
}
}
}

Related

Histogram is not starting at the right min even filter added

The Mapping
"eventTime": {
"type": "long"
},
The Query
POST some_indices/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"eventTime": {
"from": 1563120000000,
"to": 1565712000000,
"format": "epoch_millis"
}
}
}
}
},
"aggs": {
"min_eventTime": { "min" : { "field": "eventTime"} },
"max_eventTime": { "max" : { "field": "eventTime"} },
"time_series": {
"histogram": {
"field": "eventTime",
"interval": 86400000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
}
}
}
The Response
"aggregations": {
"max_eventTime": {
"value": 1565539199997
},
"min_eventTime": {
"value": 1564934400000
},
"time_series": {
"buckets": [
{
"key": 1563062400000,
"doc_count": 0
},
{
"key": 1563148800000,
"doc_count": 0
},
{
...
Question
As the reference clearly mentioned
For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
I set the filter properly (as the demo does) and the min and max is also providing the evidence.
But why still the first key is SMALLER THAN than the from (or min_eventTime)?
So weird and I totally get lost now ;(
Any advice will be appreciated ;)
References
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation
I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.
I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.
"aggs": {
"time_series": {
"date_histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": "+16h",
"min_doc_count": 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
},
"aggs": {
"order_amount_total": {
"sum": {
"field": "order_amount"
}
}
}
}
}
Updated
Thanks for the help of #Val, I re-think about it and have a test as follows:
#Test
public void testComputation() {
System.out.println(1563120000000L % 86400000L); // 57600000
System.out.println(1563062400000L % 86400000L); // 0
}
I want to quote from the doc
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.
So basically in my case, I could use offset of histogram to solve the issue as follows.
I don't actually need date_histogram at all.
"histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": 57600000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
A clear explanation posted by Elastic Search member #polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.
A reason for the design I want to quote here:
if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.

How to calculate the number of empty bucket when aggregating by days?

I want to get the number of days that a person stayed in a town in May (Month equal to 5).
This is my query, but it gives me the number of entries in myindex that have PersonID equal to 111 and Month equal to 5. For example, this query may give me an output like 90, but there are maximally 31 days per month.
GET myindex/_search?
{
"size":0,
"query": {
"bool": {
"must": [
{ "match": {
"PersonID": "111"
}},
{ "match": {
"Month": "5"
}}
]
} },
"aggs": {
"stay_days": {
"terms" : {
"field": "Month"
}
}
}
}
In myindex I have fields like DateTime with the date and time when a person was registered by a camera, e.g. 2017-05-01T00:30:08". So, during a single day the same person may pass several times by the camera, but it should be count as 1.
How can I update my query in order to calculate the number of days per month instead of the number of capturing by a camera?
Assuming your DateTime field called datetime, one way to consider is DateHistogram aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"PersonID": "111"
}
},
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"my_day_histogram": {
"date_histogram": {
"field": "datetime",
"interval": "1d",
"min_doc_count": 1
}
}
}
}
Pay attention, that, in the must clause I used range term with the datetime field (not necessary but you may consider the Month field redundant). Also, you may need to edit the date format in the range term to your mapping
my_day_histogram: divide the data to buckets of separate days by setting the "interval": "1d".
"min_doc_count": 1 removes buckets contains zero documents.
Other approach, remove the range/match for month 5 and extend the histogram for every day in the year.
This can be also aggregated with month histogram like so:
"aggregations": {
"my_month_histogram": {
"date_histogram": {
"field": "first_timestamp",
"interval": "1M",
"min_doc_count": 1
},
"aggregations": {
"my_day_histogram": {
"date_histogram": {
"field": "first_timestamp",
"interval": "1d"
}
}
}
}
}
Its clear to me that, in both ways you'll need to count the number of buckets for which indicates the number of days.

Elastic search date_histogram extended_bounds

I want to get date_histogram during specific period, how to restrict the date period? Should I use the extended_bounds parameter? For example : I want to query the date_histogram between '2016-08-01' and '2016-08-31', and the interval is day. I query with this expression :
{
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
But I get the date_histogram not in the range.
You're almost there, you need to add a range query in order to only select documents whose createDate field is in the desired range.
{
"query": {
"range": { <---- add this range query
"createDate": {
"gte": "2016-08-01T00:00:00.000Z",
"lt": "2016-09-01T00:00:00.000Z"
}
}
},
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
The role of the extended_bounds parameter is to make sure you'll get daily buckets from min to max even if there are no documents in them. For instance, say you have 1 document each day between 2016-08-04 and 2016-08-28, then without the extended_bounds parameter, you'd get 25 buckets (2016-08-04, 2016-08-05, 2016-08-06, ..., 2016-08-28).
With the extended_bounds parameter, you'll also get the following buckets but with 0 documents:
2016-08-01
2016-08-02
2016-08-03
2016-08-29
2016-08-30
2016-08-31

Elasticsearch Date_Histogram does not cover entire filter

I'm using ES Date Histogram and a weird behavior started happening and I'm wondering why.
This is the request i'm sending to elasticsearch:
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"filter": {
"and": [
{
"bool": {
"must": [
{
"range": {
"publishTime": {
"from": "2010-07-02T12:15:20.000Z",
"to": "2015-07-08T12:43:59.000Z"
}
}
}
]
}
}
]
}
}
},
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
}
}
}
}
The result i'm getting are buckets, and the first bucket is:
{
"key_as_string": "2010-08-24T00:00:00.000Z",
"key": 1282608000000,
"doc_count": 1
}
So i'm filtering from 2010-07-02 and getting results only from 2010-08-24
This is just an example, I also saw this behavior with many more missing buckets (several months).
[edit]
this seems to correlate with the date of the first result, meaning that the first result in that time range is from 2010-08-24, but as I included "min_doc_count": 0 I expect to get results from that entire range
min_doc_count is only sufficient for returning empty buckets between the first and last documents matched by your filter. If you want to get results for the entire range you need to use extended_bounds as well:
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
"extended_bounds": {
"min": 1278072920000,
"max": 1436359439000
}
}
}
}

How can I count the number of documents where a field is within a certain range?

I am trying to build an elasticsearch query that counts the number of documents where a certain field is within a certain range. This aggregation is also contained inside of a date histogram aggregation, but I don't think that matters for the purpose of this question.
Example Data:
ID: Score
01: 4
02: 5
03: 10
04: 9
I would like to count the number of documents where 'Score' is >= 9. I have tried scripts and filters within this aggregation, but I can't get it to work.
This aggregation counts all documents, not just the ones that match the script.
"aggs": {
"report_days": {
"date_histogram": {
"field": "Date",
"interval": "day"
},
"aggs": {
"value_count": {
"field": "Score",
"script": "_value >=9"
}
}
}
}
This following aggregation gives me a parse failure, saying Parse Failure [Expected [START_OBJECT] under [field], but got a [VALUE_STRING] in [value_count]]:
"aggs": {
"report_days": {
"date_histogram": {
"field": "Date",
"interval": "day"
},
"aggs": {
"value_count": {
"field": "Score",
"filter": {
"range": {
"Score": {
"gte": 9
}
}
}
}
}
}
}
Thanks for any suggestions!
This query will give you the number of docs with score >= 9
{
"query": {
"range": {
"score": {
"gte": 9
}
}
}
}
and this agg will do the same
{
"aggs": {
"my agg": {
"range": {
"field": "score",
"ranges": [
{
"from": 9
}
]
}
}
}
}
Run the query ("score:>9") and check the hits->total value. See the examples in the doc.

Resources