Elasticsearch: aggregation min_doc_count for weeks doesn't work - hadoop

I've the following aggregation with interval=week and min_doc_count=0
{
"aggs": {
"scores_by_date": {
"date_histogram": {
"field": "date",
"format": "yyyy-MM-dd",
"interval": "week",
"min_doc_count": 0
}
}
}
and date filter from Jan-01-2015 to Feb-23-2015
{
"range": {
"document.date": {
"from": "2015-01-01",
"to": "2015-02-23"
}
}
}
I expected Elasticsearch to fill seven weeks even if empty and return buckets but end up only with one item in it
{
"aggregations": {
"scores_by_date": {
"buckets": [
{
"key_as_string": "2015-01-05",
"key": 1420416000000,
"doc_count": 5
}
]
}
}
}
Elasticsearch version: 1.4.0
What is wrong with my aggregation or how can I say Elasticsearch to fill missing weeks?

You can try specifying extended bounds (there's documentation discussing this feature on the official doc page for histogram aggregations). The most relevant nugget from those docs is this:
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
So your aggregation may have to look something like this to force ES to return empty buckets in that range:
{
"aggs": {
"scores_by_date": {
"date_histogram": {
"field": "date",
"format": "yyyy-MM-dd",
"interval": "week",
"min_doc_count": 0,
"extended_bounds" : {
"min" : "2015-01-01",
"max" : "2015-02-23"
}
}
}
}

Related

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Histogram is not starting at the right min even filter added

The Mapping
"eventTime": {
"type": "long"
},
The Query
POST some_indices/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"eventTime": {
"from": 1563120000000,
"to": 1565712000000,
"format": "epoch_millis"
}
}
}
}
},
"aggs": {
"min_eventTime": { "min" : { "field": "eventTime"} },
"max_eventTime": { "max" : { "field": "eventTime"} },
"time_series": {
"histogram": {
"field": "eventTime",
"interval": 86400000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
}
}
}
The Response
"aggregations": {
"max_eventTime": {
"value": 1565539199997
},
"min_eventTime": {
"value": 1564934400000
},
"time_series": {
"buckets": [
{
"key": 1563062400000,
"doc_count": 0
},
{
"key": 1563148800000,
"doc_count": 0
},
{
...
Question
As the reference clearly mentioned
For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
I set the filter properly (as the demo does) and the min and max is also providing the evidence.
But why still the first key is SMALLER THAN than the from (or min_eventTime)?
So weird and I totally get lost now ;(
Any advice will be appreciated ;)
References
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation
I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.
I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.
"aggs": {
"time_series": {
"date_histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": "+16h",
"min_doc_count": 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
},
"aggs": {
"order_amount_total": {
"sum": {
"field": "order_amount"
}
}
}
}
}
Updated
Thanks for the help of #Val, I re-think about it and have a test as follows:
#Test
public void testComputation() {
System.out.println(1563120000000L % 86400000L); // 57600000
System.out.println(1563062400000L % 86400000L); // 0
}
I want to quote from the doc
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.
So basically in my case, I could use offset of histogram to solve the issue as follows.
I don't actually need date_histogram at all.
"histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": 57600000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
A clear explanation posted by Elastic Search member #polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.
A reason for the design I want to quote here:
if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.

ElasticSearch how display all documents matching date range aggregation

Following elastic docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
Question:
How to make date range aggregation and display all documents that match to relevant date bucket just not the doc_count.
The Aggregation :
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd"
}
}
}
}
Response:
{
"aggregations": {
"articles_over_time": {
"buckets": [
{
"key_as_string": "2013-02-02",
"key": 1328140800000,
"doc_count": 1
},
{
"key_as_string": "2013-03-02",
"key": 1330646400000,
"doc_count": 2 //how display whole json ??
[ .. Here i want to display
all document with array based
NOT only doc_count:2.......... ]
},
...
]
}
}
}
Maybe I need to do some sub-aggregation or something else?
Any ideas?
You have to perform top_hits sub-aggregation on date-histogram aggregation. All the options can be read from here.
Your final aggregation would look like this
{
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "date",
"interval": "1M",
"format": "yyyy-MM-dd"
},
"aggs": {
"documents": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Like what Sumit says, however, I think what you really want is to create a filter with a date range:
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-range-query.html#ranges-on-dates
That way you filter out documents not in the date range and only keep the right documents. Than you can do everything you want with the results.

Elasticsearch Date_Histogram does not cover entire filter

I'm using ES Date Histogram and a weird behavior started happening and I'm wondering why.
This is the request i'm sending to elasticsearch:
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"filter": {
"and": [
{
"bool": {
"must": [
{
"range": {
"publishTime": {
"from": "2010-07-02T12:15:20.000Z",
"to": "2015-07-08T12:43:59.000Z"
}
}
}
]
}
}
]
}
}
},
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
}
}
}
}
The result i'm getting are buckets, and the first bucket is:
{
"key_as_string": "2010-08-24T00:00:00.000Z",
"key": 1282608000000,
"doc_count": 1
}
So i'm filtering from 2010-07-02 and getting results only from 2010-08-24
This is just an example, I also saw this behavior with many more missing buckets (several months).
[edit]
this seems to correlate with the date of the first result, meaning that the first result in that time range is from 2010-08-24, but as I included "min_doc_count": 0 I expect to get results from that entire range
min_doc_count is only sufficient for returning empty buckets between the first and last documents matched by your filter. If you want to get results for the entire range you need to use extended_bounds as well:
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
"extended_bounds": {
"min": 1278072920000,
"max": 1436359439000
}
}
}
}

Calculating sum of nested fields with date_histogram aggregation in Elasticsearch

I'm having trouble getting the sum of a nested field in Elasticsearch using a date_histogram, and I'm hoping somebody can lend me a hand.
I have a mapping that looks like this:
"client" : {
// various irrelevant stuff here...
"associated_transactions" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"amount" : {
"type" : "double"
},
"effective_at" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
I'm trying to get a date_histogram that shows total revenue by month across all clients--i.e. a time series showing the sum associated_transactions.amount in a histogram determined by associated_transactions.effective_date. I tried running this query:
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
},
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
But the sum it's giving me isn't right. It seems that what ES is doing is finding all clients who have any transaction in a given month, then summing all of the transactions (from any time) for those clients. That is, it's a sum of the amount spent in the lifetime of a client who made a purchase in a given month, not the sum of purchases in a given month.
Is there any way to get the data I'm looking for, or is this a limitation in how ES handles nested fields?
Thanks very much in advance for your help!
David
Try this?
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
}
i.e. move the "aggs" key into the "date_histogram" field.
I stumbled upon this question while trying to solve similar problem with my implementation of ES.
It seems that currently Elasticsearch looks at position of aggregation in the JSON body request tree - not inheritance of its objects and filelds. So you should not put your sum aggregation "inside" "date_histogram", but place it outside on the same JSON tree level.
This worked for me:
{
"size": 0,
"aggs": {
"histogram_aggregation": {
"date_histogram": {
"field": "date_vield",
"calendar_interval": "day"
},
"aggs": {
"views": {
"sum": {
"field": "the_vield_i_want_to_sum"
}
}
}
}
},
"query": {
#some query
}
OP made mistake of placing his sum aggregation inside date histogram aggregation.

Resources