Sorting elasticsearch Range Aggregation - elasticsearch

Is it possible to change the sorting of results of a Range Aggregation in elasticsearch? I have a keyed Range query in elasticsearch and want to sort according to keys instead of doc_count.
My documents are:
POST /docs/doc/1
{
"price": 12
}
POST /docs/doc/2
{
"price": 8
}
POST /docs/doc/3
{
"price": 15
}
And the aggregation query:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": 0
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}
The result for this query is:
"aggregations": {
"price_ranges": {
"buckets": {
"to10": {
"from": 0,
"from_as_string": "0.0",
"to": 10,
"to_as_string": "10.0",
"doc_count": 2
},
"all": {
"from": 0,
"from_as_string": "0.0",
"doc_count": 4
},
"from11": {
"from": 11,
"from_as_string": "11.0",
"doc_count": 2
}
}
}
}
I'd like to sort the results according to the key, not according to range value. According to elasticsearch documentation it is not possible to specify a sort order and When specifying a sort order I get the following exception:
"reason": "Unknown key for a START_ARRAY in [price_ranges]: [order]."
Any ideas on how to cope with this? Thanks!

Since the keys seem to be ordered according to ascending values of the from value, you can "cheat" a little bit and modify the from value of the all bucket to -1, then the all bucket will appear first, then to10 and finally from11:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": -1
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}

In general you can use a bucket_sort aggregation. Then you can sort by _key, _count or a sub-aggregation (i.e. by a metric calculated for each bucket).
Given that you only have three fixed buckets though, I would simply do the sorting on the client-side instead.

Related

Get buckets containing documents in ElasticSearch

I have a query like that:
https://pastebin.com/9YK6WxEJ
this gives me:
https://pastebin.com/ranpCnzG
Now, the buckets are fine but I want to get the documents' data grouped by bucket name, not just their count in doc_count. Is there any way to do that?
Maybe this works for you?
"aggs": {
"rating_ranges": {
"range": {
"field": "AggregateRating",
"keyed": true,
"ranges": [
{
"key": "bad",
"to": 3
},
{
"key": "average",
"from": 3,
"to": 4
},
{
"key": "good",
"from": 4
}
]
},
"aggs": {
"hits": {
"top_hits": {
"size": 100,
"sort": [
{
"AggregateRating": {
"order": "desc"
}
}
]
}
}
}
}
}

Sorting percentiles aggregation with NaN values

I'm using ElasticSearch 2.3.3 and I have the following aggregation:
"aggregations": {
"mainBreakdown": {
"terms": {
"field": "location_i",
"size": 10,
"order": [
{
"comments>medianTime.50": "asc"
}
]
},
"aggregations": {
"comments": {
"filter": {
"term": {
"type_i": 120
}
},
"aggregations": {
"medianTime": {
"percentiles": {
"field": "time_l",
"percents": [
50.0
]
}
}
}
}
}
}
}
for better understanding I've added to field names a postfix which tells the field mapping:
_i = integer
_l = long (timestamp)
And aggregation response is:
"aggregations": {
"mainBreakdown": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 100,
"doc_count": 2,
"comments": {
"doc_count": 1,
"medianTime": {
"values": {
"50.0": 20113
}
}
}
},
{
"key": 121,
"doc_count": 14,
"comments": {
"doc_count": 0,
"medianTime": {
"values": {
"50.0": "NaN"
}
}
}
}
]
}
}
My problem is that the medianTime aggregation, sometimes has value of NaN because the parent aggregation comments has 0 matched documents, and then the result with the NaN will always be last on both "asc" and "desc" order.
I've tried adding "missing": 0 inside percentiles aggregation but it still returns a NaN.
Can you please help me sorting my buckets by medianTime that and when it's "asc" ordering the NaN values will be first and when its "desc" they will be last?
NaN's are not numbers, so they will always be last.
After a short discussion on elasticsearch github, we decided its the appropriate way to handle NaN's.
https://github.com/elastic/elasticsearch/issues/36402

Date_histogram aggregation returns bad results

I had to create aggregation that counts number of documents containing in date ranges.
My query looks like:
{
"query":{
"range":{
"doc.createdTime":{
"gte":1483228800000,
"lte":1485907199999
}
}
},
"size":0,
"aggs":{
"by_day":{
"histogram":{
"field":"doc.createdTime",
"interval":"604800000ms",
"format":"yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count":0,
"extended_bounds":{
"min":1483228800000,
"max":1485907199999
}
}
}
}
}
Interval: 604800000 equals to 7 days.
As a result, I recive these:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2016-12-29T00:00:00+00:00",
"key": 1482969600000,
"doc_count": 0
},
{
"key_as_string": "2017-01-05T00:00:00+00:00",
"key": 1483574400000,
"doc_count": 603
},
{
"key_as_string": "2017-01-12T00:00:00+00:00",
"key": 1484179200000,
"doc_count": 3414
},
{
"key_as_string": "2017-01-19T00:00:00+00:00",
"key": 1484784000000,
"doc_count": 71551
},
{
"key_as_string": "2017-01-26T00:00:00+00:00",
"key": 1485388800000,
"doc_count": 105652
}
]
}
}
As You can mantion that my buckets starts from 29/12/2016, but as a range query do not cover this date. I expect my buckets should start from 01/01/2017 as I pointed in the range query. This problem occurs only in query with interval with number of days greater then 1. In case of any other intervals it works fine. I've tried with day, months and hours and it works fine.
I've tried also to use filtered aggs and only then use date_histogram. Result is the same.
I'm using Elasticsearch 2.2.0 version.
And the question is how I can force date_histogram to start from date I need?
Try to add offset param with value calculated from given formula:
value = start_date_in_ms % week_in_ms = 1483228800000 % 604800000 =
259200000
{
"query": {
"range": {
"doc.createdTime": {
"gte": 1483228800000,
"lte": 1485907199999
}
}
},
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "doc.createdTime",
"interval": "604800000ms",
"offset": "259200000ms",
"format": "yyyy-MM-dd'T'HH:mm:ssZZ",
"min_doc_count": 0,
"extended_bounds": {
"min": 1483228800000,
"max": 1485907199999
}
}
}
}
}

Using Elasticsearch Date Histogram Aggregations to Count Dates in Array Properties

I have an elasticsearch index with the following document:
{
dates: ["2014-01-31","2014-02-01"]
}
I want to count all the instances of all the days in my index separated by year and month. I hoped to do this using a date histogram aggregation (which is successful for counting non-array properties):
{
"from": 0,
"size": 0,
"aggregations": {
"year": {
"date_histogram": {
"field": "dates",
"interval": "1y",
"format": "yyyy"
},
"aggregations": {
"month": {
"date_histogram": {
"field": "dates",
"interval": "1M",
"format": "M"
},
"aggregations": {
"day": {
"date_histogram": {
"field": "dates",
"interval": "1d",
"format": "d"
}
}
}
}
}
}
}
}
However, I get the following aggregation results:
"aggregations": {
"year": {
"buckets": [
{
"key_as_string": "2014",
"key": 1388534400000,
"doc_count": 1,
"month": {
"buckets": [
{
"key_as_string": "1",
"key": 1388534400000,
"doc_count": 1,
"day": {
"buckets": [
{
"key_as_string": "31",
"key": 1391126400000,
"doc_count": 1
},
{
"key_as_string": "1",
"key": 1391212800000,
"doc_count": 1
}
]
}
},
{
"key_as_string": "2",
"key": 1391212800000,
"doc_count": 1,
"day": {
"buckets": [
{
"key_as_string": "31",
"key": 1391126400000,
"doc_count": 1
},
{
"key_as_string": "1",
"key": 1391212800000,
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
The "day" aggregation ignores the bucket of its parent "month" aggregation, so it processes both elements of the array in each bucket, counting each date twice. The results indicate that two dates appear in each month (and four total), which is obviously incorrect.
I've tried reducing my aggregation to a single date histogram (and bucketing the results in java based on the key) but the doc_count returns as one instead of the number of elements in the array (two in my example). Adding a value_count brings me back to my original issue in which documents that overlap multiple buckets have their dates double-counted.
Is there a way to add a filter to the date histogram aggregations or otherwise modify them in order to count the elements in my date arrays correctly? Alternatively, does Elasticsearch have an option to unwind arrays like in MongoDB? I want to avoid using scripting due to security concerns.
Thanks,
Thomas

Distance banding with Elastic Search

I would like to apply "distance banding". Instead of just simply sorting by distance, I would like the documents within 5 miles come first, followed by 5-10 mi documents, followed by 10-15mi, 15-25 mi, 25-50 mi, 50+mi. (And within each distance band they will be sorted by some other criteria).
I read on function_score decay, but I don't think it quite fits the purpose.
How would you suggest to go about it? boosting?
One way to achieve this is using the geo_distance aggregation to define the bands and then in each band use a top_hits with some sort criteria.
It would look like this. You will need to change the location field (location) and the sort field (name) to match yours:
{
"size": 0,
"aggs": {
"rings": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"ranges": [
{
"to": 5
},
{
"from": 5, "to": 10
},
{
"from": 10, "to": 15
},
{
"from": 15, "to": 25
},
{
"from": 25, "to": 50
},
{
"from": 50
}
]
},
"aggs": {
"hits": {
"top_hits": {
"size": 5,
"sort": {
"name": "asc"
}
}
}
}
}
}
}

Resources