I have an elasticsearch index with the following document:
{
dates: ["2014-01-31","2014-02-01"]
}
I want to count all the instances of all the days in my index separated by year and month. I hoped to do this using a date histogram aggregation (which is successful for counting non-array properties):
{
"from": 0,
"size": 0,
"aggregations": {
"year": {
"date_histogram": {
"field": "dates",
"interval": "1y",
"format": "yyyy"
},
"aggregations": {
"month": {
"date_histogram": {
"field": "dates",
"interval": "1M",
"format": "M"
},
"aggregations": {
"day": {
"date_histogram": {
"field": "dates",
"interval": "1d",
"format": "d"
}
}
}
}
}
}
}
}
However, I get the following aggregation results:
"aggregations": {
"year": {
"buckets": [
{
"key_as_string": "2014",
"key": 1388534400000,
"doc_count": 1,
"month": {
"buckets": [
{
"key_as_string": "1",
"key": 1388534400000,
"doc_count": 1,
"day": {
"buckets": [
{
"key_as_string": "31",
"key": 1391126400000,
"doc_count": 1
},
{
"key_as_string": "1",
"key": 1391212800000,
"doc_count": 1
}
]
}
},
{
"key_as_string": "2",
"key": 1391212800000,
"doc_count": 1,
"day": {
"buckets": [
{
"key_as_string": "31",
"key": 1391126400000,
"doc_count": 1
},
{
"key_as_string": "1",
"key": 1391212800000,
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
The "day" aggregation ignores the bucket of its parent "month" aggregation, so it processes both elements of the array in each bucket, counting each date twice. The results indicate that two dates appear in each month (and four total), which is obviously incorrect.
I've tried reducing my aggregation to a single date histogram (and bucketing the results in java based on the key) but the doc_count returns as one instead of the number of elements in the array (two in my example). Adding a value_count brings me back to my original issue in which documents that overlap multiple buckets have their dates double-counted.
Is there a way to add a filter to the date histogram aggregations or otherwise modify them in order to count the elements in my date arrays correctly? Alternatively, does Elasticsearch have an option to unwind arrays like in MongoDB? I want to avoid using scripting due to security concerns.
Thanks,
Thomas
Related
Is it possible to get an array of elasticsearch document id while group by, i.e
Current output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310
},
{
"key": "Unknown",
"doc_count": 15
},
{
"key": "Document",
"doc_count": 13
}
]
}
}
Desired output
"aggregations": {,
"types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Text Document",
"doc_count": 3310,
"ids":["doc1","doc2", "doc3"....]
},
{
"key": "Unknown",
"doc_count": 15,
"ids":["doc11","doc12", "doc13"....]
},
{
"key": "Document",
"doc_count": 13
"ids":["doc21","doc22", "doc23"....]
}
]
}
}
Not sure if this is possible in elasticsearch or not,
below is my aggregation query:
{
"size": 0,
"aggs": {
"types": {
"terms": {
"field": "docType",
"size": 10
}
}
}
}
Elasticsearch version:
6.3.2
You can use top_hits aggregation which will return all documents under an aggregation. Using source filtering you can select fields under hits
Query:
"aggs": {
"district": {
"terms": {
"field": "docType",
"size": 10
},
"aggs": {
"docs": {
"top_hits": {
"size": 10,
"_source": ["ids"]
}
}
}
}
}
For anyone interested, another solution is to create a custom key value using a script to create a string of delineated values from the doc, including the id. It may not be pretty, but you can then parse it out later - and if you just need something minimal like the doc id, it may be worth it.
{
"size": 0,
"aggs": {
"types": {
"terms": {
"script": "doc['docType'].value+'::'+doc['_id'].value",
"size": 10
}
}
}
}
I had to create aggregation that counts number of documents containing in date ranges.
My query looks like:
{
"query":{
"range":{
"doc.createdTime":{
"gte":1483228800000,
"lte":1485907199999
}
}
},
"size":0,
"aggs":{
"by_day":{
"histogram":{
"field":"doc.createdTime",
"interval":"604800000ms",
"format":"yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count":0,
"extended_bounds":{
"min":1483228800000,
"max":1485907199999
}
}
}
}
}
Interval: 604800000 equals to 7 days.
As a result, I recive these:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2016-12-29T00:00:00+00:00",
"key": 1482969600000,
"doc_count": 0
},
{
"key_as_string": "2017-01-05T00:00:00+00:00",
"key": 1483574400000,
"doc_count": 603
},
{
"key_as_string": "2017-01-12T00:00:00+00:00",
"key": 1484179200000,
"doc_count": 3414
},
{
"key_as_string": "2017-01-19T00:00:00+00:00",
"key": 1484784000000,
"doc_count": 71551
},
{
"key_as_string": "2017-01-26T00:00:00+00:00",
"key": 1485388800000,
"doc_count": 105652
}
]
}
}
As You can mantion that my buckets starts from 29/12/2016, but as a range query do not cover this date. I expect my buckets should start from 01/01/2017 as I pointed in the range query. This problem occurs only in query with interval with number of days greater then 1. In case of any other intervals it works fine. I've tried with day, months and hours and it works fine.
I've tried also to use filtered aggs and only then use date_histogram. Result is the same.
I'm using Elasticsearch 2.2.0 version.
And the question is how I can force date_histogram to start from date I need?
Try to add offset param with value calculated from given formula:
value = start_date_in_ms % week_in_ms = 1483228800000 % 604800000 =
259200000
{
"query": {
"range": {
"doc.createdTime": {
"gte": 1483228800000,
"lte": 1485907199999
}
}
},
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "doc.createdTime",
"interval": "604800000ms",
"offset": "259200000ms",
"format": "yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count": 0,
"extended_bounds": {
"min": 1483228800000,
"max": 1485907199999
}
}
}
}
}
Here is the mappings of my index PublicationsLikes:
id : String
account : String
api : String
date : Date
I'm currently making an aggregation on ES where I group the results counts by the id (of the publication).
{
"key": "<publicationId-1>",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"doc_count": 387
},
{
"key": "<publicationId-3>",
"doc_count": 7831
}
The returned "key" (the id) is an information but I also need to select another fields of the publication like account and api. A bit like that:
{
"key": "<publicationId-1>",
"api": "Facebook",
"accountId": "65465z4fe6ezf456ezdf",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"api": "Twitter",
"accountId": "afaez5f4eaz",
"doc_count": 387
}
How can I manage this?
Thanks.
This requirement is best achieved by top_hits aggregation, where you can sort the documents in each bucket and choose the first and also you can control which fields you want returned:
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"field": "id"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["api","accountId"]
}
}
}
}
}
}
You can use subaggregation for this.
GET /PublicationsLikes/_search
{
"aggs" : {
"ids": {
"terms": {
"field": "id"
},
"aggs": {
"accounts": {
"terms": {
"field": "account",
"size": 1
}
}
}
}
}
}
Your result will not exactly what you want but it will be a bit similar:
{
"key": "<publicationId-1>",
"doc_count": 25,
"accounts": {
"buckets": [
{
"key": "<account-1>",
"doc_count": 25
}
]
}
},
{
"key": "<publicationId-2>",
"doc_count": 387,
"accounts": {
"buckets": [
{
"key": "<account-2>",
"doc_count": 387
}
]
}
},
{
"key": "<publicationId-3>",
"doc_count": 7831,
"accounts": {
"buckets": [
{
"key": "<account-3>",
"doc_count": 7831
}
]
}
}
You can also check the link to find more information
Thanks both for your quick replies. I think the first solution is the most "beautiful" (in terms of request but also to retrieves the results) but both seems to be sub aggregations queries.
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"size": 0,
"field": "publicationId"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["accountId", "api"]
}
}
}
}
}
}
I think I must be careful to size=0 parameter, so, because I work in the Java Api, I decided to put INT.Max instead of 0.
Thnaks a lot guys.
Im firing a query, only with date_histogram aggregation like below
{
"aggs": {
"mentionsAnalytics": {
"date_histogram": {
"field": "created_at",
"interval": "day"
}
}
}
Now the response Im getting is not sorted according to date. Following is my response
"aggregations": {
"mentionsAnalytics": {
"buckets": [
{
"key_as_string": "2014-10-17T00:00:00.000Z",
"key": 1413504000000,
"doc_count": 2
}
,
{
"key_as_string": "2015-09-07T00:00:00.000Z",
"key": 1441584000000,
"doc_count": 2
}
,
{
"key_as_string": "2015-09-29T00:00:00.000Z",
"key": 1443484800000,
"doc_count": 2
}
,
{
"key_as_string": "2015-11-09T00:00:00.000Z",
"key": 1447027200000,
"doc_count": 4
}
]
}
}
As you can see, the dates are occuring in random order. Can I make it sorted,by modifying any parameters inside the date_histogram aggregation. I dont prefer to give a sort query, as a seperate query. Is that possible?
{
"aggs": {
"mentionsAnalytics": {
"date_histogram": {
"field": "created_at",
"interval": "day",
"order": {
"_key": "desc"
}
}
}
}
The field to be sorted on from the aggregation response should be given inside order and the type of sorting
Is it possible to change the sorting of results of a Range Aggregation in elasticsearch? I have a keyed Range query in elasticsearch and want to sort according to keys instead of doc_count.
My documents are:
POST /docs/doc/1
{
"price": 12
}
POST /docs/doc/2
{
"price": 8
}
POST /docs/doc/3
{
"price": 15
}
And the aggregation query:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": 0
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}
The result for this query is:
"aggregations": {
"price_ranges": {
"buckets": {
"to10": {
"from": 0,
"from_as_string": "0.0",
"to": 10,
"to_as_string": "10.0",
"doc_count": 2
},
"all": {
"from": 0,
"from_as_string": "0.0",
"doc_count": 4
},
"from11": {
"from": 11,
"from_as_string": "11.0",
"doc_count": 2
}
}
}
}
I'd like to sort the results according to the key, not according to range value. According to elasticsearch documentation it is not possible to specify a sort order and When specifying a sort order I get the following exception:
"reason": "Unknown key for a START_ARRAY in [price_ranges]: [order]."
Any ideas on how to cope with this? Thanks!
Since the keys seem to be ordered according to ascending values of the from value, you can "cheat" a little bit and modify the from value of the all bucket to -1, then the all bucket will appear first, then to10 and finally from11:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": -1
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}
In general you can use a bucket_sort aggregation. Then you can sort by _key, _count or a sub-aggregation (i.e. by a metric calculated for each bucket).
Given that you only have three fixed buckets though, I would simply do the sorting on the client-side instead.