Histogram is not starting at the right min even filter added - elasticsearch

The Mapping
"eventTime": {
"type": "long"
},
The Query
POST some_indices/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"eventTime": {
"from": 1563120000000,
"to": 1565712000000,
"format": "epoch_millis"
}
}
}
}
},
"aggs": {
"min_eventTime": { "min" : { "field": "eventTime"} },
"max_eventTime": { "max" : { "field": "eventTime"} },
"time_series": {
"histogram": {
"field": "eventTime",
"interval": 86400000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
}
}
}
The Response
"aggregations": {
"max_eventTime": {
"value": 1565539199997
},
"min_eventTime": {
"value": 1564934400000
},
"time_series": {
"buckets": [
{
"key": 1563062400000,
"doc_count": 0
},
{
"key": 1563148800000,
"doc_count": 0
},
{
...
Question
As the reference clearly mentioned
For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
I set the filter properly (as the demo does) and the min and max is also providing the evidence.
But why still the first key is SMALLER THAN than the from (or min_eventTime)?
So weird and I totally get lost now ;(
Any advice will be appreciated ;)
References
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation

I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.
I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.
"aggs": {
"time_series": {
"date_histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": "+16h",
"min_doc_count": 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
},
"aggs": {
"order_amount_total": {
"sum": {
"field": "order_amount"
}
}
}
}
}
Updated
Thanks for the help of #Val, I re-think about it and have a test as follows:
#Test
public void testComputation() {
System.out.println(1563120000000L % 86400000L); // 57600000
System.out.println(1563062400000L % 86400000L); // 0
}
I want to quote from the doc
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.
So basically in my case, I could use offset of histogram to solve the issue as follows.
I don't actually need date_histogram at all.
"histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": 57600000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
A clear explanation posted by Elastic Search member #polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.
A reason for the design I want to quote here:
if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.

Related

Is it possible to make elasticsearch aggregations faster when I only want the top 10 buckets?

I understand elasticsearch aggregation queries take a long time to execute by nature, especially on high cardinality fields. In our use case, we only need to bring back the first x buckets sorted alphabetically. Considering we only need to bring back 10 buckets, is there a way to make our query faster? Is there a way to get Elasticsearch to look at only the first 10 buckets in each shard and compare those only?
Here is my query...
{
"size": "600",
"timeout": "60s",
"query": {
"bool": {
"must": [
{
"match_all": {
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"firstname": {
"terms": {
"field": "firstname.keyword",
"size": 10,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": {
"_key": "asc"
},
"include": ".*.*",
"exclude": ""
}
}
}
}
I think I am on to something using a composite aggregation instead. Like this...
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"field": "home_address1.keyword",
"order": "asc"
}
}
}
]
}
}
Testing in Postman shows that this request takes way faster. Is this expected? If so, awesome. How can I add the include, exclude attributes to the composite query? For example, sometimes I only want to include buckets whose value matches "A.*"
If this query should not be any faster, then why does it appear so?
Composite terms aggs unfortunately don't support include, exclude and many other standard terms aggs' parameters so you've got 2 options:
Filter out the docs that don't match your prefix from within the query, like #Val pointed out.
or
use a script to do the filtering for you. This should be your last resort though -- scripts are pretty much guaranteed to run more slowly than standard query filters.
{
"size": 0,
"aggs": {
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"order": "asc",
"script": {
"source": """
def val = doc['home_address1.keyword'].value;
if (val == null) { return null; }
if (val.indexOf('A.') === 0) {
return val;
}
return null;
""",
"lang": "painless"
}
}
}
}
]
}
}
}
}
BTW your original terms agg already seems optimized so it surprises me that the composite is faster.

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Elasticsearch Date_Histogram does not cover entire filter

I'm using ES Date Histogram and a weird behavior started happening and I'm wondering why.
This is the request i'm sending to elasticsearch:
{
"from": 0,
"size": 0,
"query": {
"filtered": {
"filter": {
"and": [
{
"bool": {
"must": [
{
"range": {
"publishTime": {
"from": "2010-07-02T12:15:20.000Z",
"to": "2015-07-08T12:43:59.000Z"
}
}
}
]
}
}
]
}
}
},
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
}
}
}
}
The result i'm getting are buckets, and the first bucket is:
{
"key_as_string": "2010-08-24T00:00:00.000Z",
"key": 1282608000000,
"doc_count": 1
}
So i'm filtering from 2010-07-02 and getting results only from 2010-08-24
This is just an example, I also saw this behavior with many more missing buckets (several months).
[edit]
this seems to correlate with the date of the first result, meaning that the first result in that time range is from 2010-08-24, but as I included "min_doc_count": 0 I expect to get results from that entire range
min_doc_count is only sufficient for returning empty buckets between the first and last documents matched by your filter. If you want to get results for the entire range you need to use extended_bounds as well:
"aggs": {
"agg|date_histogram|publishTime": {
"date_histogram": {
"field": "publishTime",
"interval": "1d",
"min_doc_count": 0
"extended_bounds": {
"min": 1278072920000,
"max": 1436359439000
}
}
}
}

Elasticsearch: aggregation min_doc_count for weeks doesn't work

I've the following aggregation with interval=week and min_doc_count=0
{
"aggs": {
"scores_by_date": {
"date_histogram": {
"field": "date",
"format": "yyyy-MM-dd",
"interval": "week",
"min_doc_count": 0
}
}
}
and date filter from Jan-01-2015 to Feb-23-2015
{
"range": {
"document.date": {
"from": "2015-01-01",
"to": "2015-02-23"
}
}
}
I expected Elasticsearch to fill seven weeks even if empty and return buckets but end up only with one item in it
{
"aggregations": {
"scores_by_date": {
"buckets": [
{
"key_as_string": "2015-01-05",
"key": 1420416000000,
"doc_count": 5
}
]
}
}
}
Elasticsearch version: 1.4.0
What is wrong with my aggregation or how can I say Elasticsearch to fill missing weeks?
You can try specifying extended bounds (there's documentation discussing this feature on the official doc page for histogram aggregations). The most relevant nugget from those docs is this:
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
So your aggregation may have to look something like this to force ES to return empty buckets in that range:
{
"aggs": {
"scores_by_date": {
"date_histogram": {
"field": "date",
"format": "yyyy-MM-dd",
"interval": "week",
"min_doc_count": 0,
"extended_bounds" : {
"min" : "2015-01-01",
"max" : "2015-02-23"
}
}
}
}

Calculating sum of nested fields with date_histogram aggregation in Elasticsearch

I'm having trouble getting the sum of a nested field in Elasticsearch using a date_histogram, and I'm hoping somebody can lend me a hand.
I have a mapping that looks like this:
"client" : {
// various irrelevant stuff here...
"associated_transactions" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"amount" : {
"type" : "double"
},
"effective_at" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
I'm trying to get a date_histogram that shows total revenue by month across all clients--i.e. a time series showing the sum associated_transactions.amount in a histogram determined by associated_transactions.effective_date. I tried running this query:
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
},
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
But the sum it's giving me isn't right. It seems that what ES is doing is finding all clients who have any transaction in a given month, then summing all of the transactions (from any time) for those clients. That is, it's a sum of the amount spent in the lifetime of a client who made a purchase in a given month, not the sum of purchases in a given month.
Is there any way to get the data I'm looking for, or is this a limitation in how ES handles nested fields?
Thanks very much in advance for your help!
David
Try this?
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
}
i.e. move the "aggs" key into the "date_histogram" field.
I stumbled upon this question while trying to solve similar problem with my implementation of ES.
It seems that currently Elasticsearch looks at position of aggregation in the JSON body request tree - not inheritance of its objects and filelds. So you should not put your sum aggregation "inside" "date_histogram", but place it outside on the same JSON tree level.
This worked for me:
{
"size": 0,
"aggs": {
"histogram_aggregation": {
"date_histogram": {
"field": "date_vield",
"calendar_interval": "day"
},
"aggs": {
"views": {
"sum": {
"field": "the_vield_i_want_to_sum"
}
}
}
}
},
"query": {
#some query
}
OP made mistake of placing his sum aggregation inside date histogram aggregation.

Resources