How to calculate the number of empty bucket when aggregating by days? - elasticsearch

I want to get the number of days that a person stayed in a town in May (Month equal to 5).
This is my query, but it gives me the number of entries in myindex that have PersonID equal to 111 and Month equal to 5. For example, this query may give me an output like 90, but there are maximally 31 days per month.
GET myindex/_search?
{
"size":0,
"query": {
"bool": {
"must": [
{ "match": {
"PersonID": "111"
}},
{ "match": {
"Month": "5"
}}
]
} },
"aggs": {
"stay_days": {
"terms" : {
"field": "Month"
}
}
}
}
In myindex I have fields like DateTime with the date and time when a person was registered by a camera, e.g. 2017-05-01T00:30:08". So, during a single day the same person may pass several times by the camera, but it should be count as 1.
How can I update my query in order to calculate the number of days per month instead of the number of capturing by a camera?

Assuming your DateTime field called datetime, one way to consider is DateHistogram aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"PersonID": "111"
}
},
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"my_day_histogram": {
"date_histogram": {
"field": "datetime",
"interval": "1d",
"min_doc_count": 1
}
}
}
}
Pay attention, that, in the must clause I used range term with the datetime field (not necessary but you may consider the Month field redundant). Also, you may need to edit the date format in the range term to your mapping
my_day_histogram: divide the data to buckets of separate days by setting the "interval": "1d".
"min_doc_count": 1 removes buckets contains zero documents.
Other approach, remove the range/match for month 5 and extend the histogram for every day in the year.
This can be also aggregated with month histogram like so:
"aggregations": {
"my_month_histogram": {
"date_histogram": {
"field": "first_timestamp",
"interval": "1M",
"min_doc_count": 1
},
"aggregations": {
"my_day_histogram": {
"date_histogram": {
"field": "first_timestamp",
"interval": "1d"
}
}
}
}
}
Its clear to me that, in both ways you'll need to count the number of buckets for which indicates the number of days.

Related

Elasticsearch aggregate on term multiple times per different time range

I'm trying to aggregate a field by each half of the time-range given in the query. For example, here's the query:
{
"query": {
"simple_query_string": {
"query": "+sitetype:(redacted) +sort_date:[now-2h TO now]"
}
}
}
...and I want to aggregate on term "product1.keyword" from now-2h to now-1h and aggregate on the same term "product1.keyword" from now-1h to now, so like:
"terms": {
"field": "product1",
"size": 10,
}
^ aggregate the top 10 results on product1 in now-2h TO now-1h,
and aggregate the top 10 results on product1 in now-1h TO now.
Clarification: product1 is not a date or time-related field. It would be like a type of car, phone, etc.
if you want use now in your query,you must make product1 field as date type,then you can try as below:
GET index1/_search
{
"size": 0,
"aggs": {
"dataAgg": {
"date_range": {
"field": "product1",
"ranges": [
{
"from": "now-2h",
"to": "now-1h"
},
{
"from": "now-1h",
"to": "now"
}
]
},
"aggs": {
"top10": {
"top_hits": {
"size": 10
}
}
}
}
}
}
and if you can't change product1's type ,you can try rang agg,but you must write the time explicitly instead of using now

Histogram over fixed range of dates (i.e. fixed number of buckets) even when data is absent

My goal is to build a histogram between a start and an end dates, the empty dates should appear in the histogram and have zero as a count value.
I am trying the following query to fetch the last 7 days:
POST my_index/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-7d/d",
"lte": "now/d"
}
}
},
"aggs" : {
"count_per_day" : {
"date_histogram" : {
"field" : "date",
"interval" : "day",
"order": {"_key": "desc"},
"min_doc_count": 0
}
}
}
}
The issues is that I have data only for the last 3 days, so there is no data at all prior to 3 days ago. In this case, the result contains only the last 3 days and the previous days are not returned at all.
But if there is a gap (i.e. there is data 6 days ago, but no data in the 5th and the 4th day), the empty days will appear with zero as a count.
How can I force to return the absent dates even if there is no data?
In other word, how to fix the number of buckets (to 7 in the example above) even if there is no data?
You have already added "min_doc_count": 0 to include empty buckets. All you need to do is to simply add extended_bounds param as well to force starting and ending buckets. More on it can be found here.
Update your query as below:
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-7d/d",
"lte": "now/d"
}
}
},
"aggs": {
"count_per_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"order": {
"_key": "desc"
},
"min_doc_count": 0,
"extended_bounds": {
"min": "now-7d/d",
"max": "now/d"
}
}
}
}
}

Defining a time range for aggregation in elasticsearch

I've got an index in ElasticSearch with documents having info about user connections to my platform. I want to build a query with day aggregation where I can count all users connected every day between two given dates.
I have 3 relevant fields to do so: user_id, connection_time_start, connection_time_end. I was doing the query this way:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"connection_time_start": {
"lte": "2017-08-04T23:59:59"
}
}
},
{
"range": {
"connection_time_end": {
"gte": "2017-08-02T00:00:00"
}
}
}
]
}
},
"aggs": {
"franja_horaria": {
"date_histogram": {
"field": "connection_time_start",
"interval": "day",
"format": "yyyy-MM-dd"
},
"aggs": {
"ids": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
This query has given as a result buckets containing the number of users that had the starting connection at day 2, 3 & 4 of August. The problem is that there are users with connections starting on day 2 and ending on day 3 and even on day 4.
These users should compute for the connected user count for each day but as I'm doing the aggregation with the connection_time_start only counts for that day.
I've tried to add a range in the aggregation some thing like this(https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-daterange-aggregation.html) but haven't got a good result.
Can anybody help me with this? Thanks in advance!

Elastic search date_histogram extended_bounds

I want to get date_histogram during specific period, how to restrict the date period? Should I use the extended_bounds parameter? For example : I want to query the date_histogram between '2016-08-01' and '2016-08-31', and the interval is day. I query with this expression :
{
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
But I get the date_histogram not in the range.
You're almost there, you need to add a range query in order to only select documents whose createDate field is in the desired range.
{
"query": {
"range": { <---- add this range query
"createDate": {
"gte": "2016-08-01T00:00:00.000Z",
"lt": "2016-09-01T00:00:00.000Z"
}
}
},
"aggs": {
"cf_loan": {
"date_histogram": {
"field": "createDate",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-08-01",
"max": "2016-08-31"
}
}
}
}
}
The role of the extended_bounds parameter is to make sure you'll get daily buckets from min to max even if there are no documents in them. For instance, say you have 1 document each day between 2016-08-04 and 2016-08-28, then without the extended_bounds parameter, you'd get 25 buckets (2016-08-04, 2016-08-05, 2016-08-06, ..., 2016-08-28).
With the extended_bounds parameter, you'll also get the following buckets but with 0 documents:
2016-08-01
2016-08-02
2016-08-03
2016-08-29
2016-08-30
2016-08-31

How to limit a date histogram aggregation of nested documents to a specific date range?

Version
Using Elasticsearch 1.7.2
Objective
I would like to create a graph of the number of predictions made by users per day for the last n days. In this case, 10 days.
Current query
{
"size": 0,
"aggs": {
"predictions": {
"nested": {
"path": "user_answers"
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
Issue
This query will return a histogram but will return buckets for all available dates across all documents. It doesn't restrict to a specific date range.
What have I tried?
I've tried a number of approaches to solving this, all of which have failed.
* Range filter, then histogram that
* Date range aggregation, then histogram the buckets
* Using extended_bounds with, full dates, now-10d and also timestamps
* Trying a range filter inside the histogram aggregation
Any guidance would be appreciated! Thanks.
query didn't work for me in that situation, what I used is a third aggs:
{
"size": 0,
"aggs": {
"user_answers": {
"nested": { "path": "user_answers" },
"aggs": {
"timed_user_answers": {
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
}
}
One aggs specifies nested, one specifies filter, and the last specifies the actual aggregation. Don't know why this syntax makes sense, but you seem to not be able to use two on the same aggs.
You need to add a query. Query can be anything except from post_filter. It should be nested and contain date range. One of the ways is to define a constant score query. Inside constant score query, use a nested filter which should use a range filter.
{
"query": {
"constant_score": {
"filter": {
"nested": {
"path": "user_answers",
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
}
}
}
}
}
}
Confirm if this works for you.

Resources