How to limit a date histogram aggregation of nested documents to a specific date range? - elasticsearch

Version
Using Elasticsearch 1.7.2
Objective
I would like to create a graph of the number of predictions made by users per day for the last n days. In this case, 10 days.
Current query
{
"size": 0,
"aggs": {
"predictions": {
"nested": {
"path": "user_answers"
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
Issue
This query will return a histogram but will return buckets for all available dates across all documents. It doesn't restrict to a specific date range.
What have I tried?
I've tried a number of approaches to solving this, all of which have failed.
* Range filter, then histogram that
* Date range aggregation, then histogram the buckets
* Using extended_bounds with, full dates, now-10d and also timestamps
* Trying a range filter inside the histogram aggregation
Any guidance would be appreciated! Thanks.

query didn't work for me in that situation, what I used is a third aggs:
{
"size": 0,
"aggs": {
"user_answers": {
"nested": { "path": "user_answers" },
"aggs": {
"timed_user_answers": {
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
}
}
One aggs specifies nested, one specifies filter, and the last specifies the actual aggregation. Don't know why this syntax makes sense, but you seem to not be able to use two on the same aggs.

You need to add a query. Query can be anything except from post_filter. It should be nested and contain date range. One of the ways is to define a constant score query. Inside constant score query, use a nested filter which should use a range filter.
{
"query": {
"constant_score": {
"filter": {
"nested": {
"path": "user_answers",
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
}
}
}
}
}
}
Confirm if this works for you.

Related

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Defining a time range for aggregation in elasticsearch

I've got an index in ElasticSearch with documents having info about user connections to my platform. I want to build a query with day aggregation where I can count all users connected every day between two given dates.
I have 3 relevant fields to do so: user_id, connection_time_start, connection_time_end. I was doing the query this way:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"connection_time_start": {
"lte": "2017-08-04T23:59:59"
}
}
},
{
"range": {
"connection_time_end": {
"gte": "2017-08-02T00:00:00"
}
}
}
]
}
},
"aggs": {
"franja_horaria": {
"date_histogram": {
"field": "connection_time_start",
"interval": "day",
"format": "yyyy-MM-dd"
},
"aggs": {
"ids": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
This query has given as a result buckets containing the number of users that had the starting connection at day 2, 3 & 4 of August. The problem is that there are users with connections starting on day 2 and ending on day 3 and even on day 4.
These users should compute for the connected user count for each day but as I'm doing the aggregation with the connection_time_start only counts for that day.
I've tried to add a range in the aggregation some thing like this(https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-daterange-aggregation.html) but haven't got a good result.
Can anybody help me with this? Thanks in advance!

Sub-aggregation or aggregation filter in elastic

I have a list of records with name and timestamp. For each name, I want to get the maximal timestamp, but I only want to get names with max timestamp before an hour ago (meaning that in my result I would only like to see a list of names and their max timestamp, but only for names that has max timestamp before an hour ago. If a name has a record with timestamp after an hour ago, I don't want to see this name in my result).
I tried to solve this issue using aggregation, by creating a term aggregation over name, and then aggregating over max timestamp and then filtering records with max timestamp after one hour ago, as follows:
{
"size": 0,
"aggs": {
"names_aggs": {
"terms": {
"field": "name",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "timestamp"
},
"aggs": {
"sub-agg": {
"filter": {
"range": {
"timestamp": {
"lt": "now-1h"
}
}
}
}
}
}
}
}
}
}
However, this query produces the following error:
{
"type": "aggregation_initialization_exception",
"reason": "Aggregator [max_timestamp] of type [max] cannot accept sub-aggregations"
}
I can basically get a similar functionality by using the timestamp filter before the max aggregation as follows:
{
"size": 0,
"aggs": {
"names_aggs": {
"terms": {
"field": "name",
"size": 10
},
"aggs": {
"maximals": {
"filter": {
"range": {
"timestamp": {
"lt": "now-1h"
}
}
},
"aggs": {
"max_timestamp": {
"max": {
"field": "timestamp"
}
}
}
}
}
}
}
}
Indeed, I get a set of results with name and max_timestamp for each name that passed the maximals filter, and a null max_timestamp for each name that didn't pass the maximals filter. This is a solution I can work with, however, this query does not return for a large amount of records, because of the maximals filter that runs for each name.
Thanks in advance for your help.

limiting date histogram to a date range without affecting results

I want to limit the results to a date range while performing the date histogram. But it seems to affect the results set (hits). Is there any way that I can do the same, but not affect the hits area?
Filter aggregation would be an ideal match here.
{
"query": {
"match": {
"Content": "my query"
}
},
"aggs": {
"filterByDate": {
"filter": {
"range": {
"<dateField>": {
"gte": "<StartDate>",
"lt": "<EndDate>"
}
}
},
"aggs": {
"dateStats": {
"date_histogram": {
"field": "<dateField>"
}
}
}
}
}
}

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources