Distance banding with Elastic Search - elasticsearch

I would like to apply "distance banding". Instead of just simply sorting by distance, I would like the documents within 5 miles come first, followed by 5-10 mi documents, followed by 10-15mi, 15-25 mi, 25-50 mi, 50+mi. (And within each distance band they will be sorted by some other criteria).
I read on function_score decay, but I don't think it quite fits the purpose.
How would you suggest to go about it? boosting?

One way to achieve this is using the geo_distance aggregation to define the bands and then in each band use a top_hits with some sort criteria.
It would look like this. You will need to change the location field (location) and the sort field (name) to match yours:
{
"size": 0,
"aggs": {
"rings": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"ranges": [
{
"to": 5
},
{
"from": 5, "to": 10
},
{
"from": 10, "to": 15
},
{
"from": 15, "to": 25
},
{
"from": 25, "to": 50
},
{
"from": 50
}
]
},
"aggs": {
"hits": {
"top_hits": {
"size": 5,
"sort": {
"name": "asc"
}
}
}
}
}
}
}

Related

ElasticSearch - Filtering a result and manipulating the documents

I have the following query - which works fine (this might not be the actual query):
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "location",
"query": {
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
}
},
{
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
]
}
}
}
Although I want to do the following (as part of the query but not affecting the existing query):
Find all documents that have field_name = 1
On all documents that have field_name = 1 run ordering by geo_distance
Remove duplicates that have field_name = 1 and the same value under field_name_2 = 2 and leave the closest item in the documents result, but remove the rest
Update (further explanation):
Aggregations can't be used as we want to manipulate the documents in the result.
Whilst also maintaining the order within the documents; meaning:
If I have 20 documents, sorted by a field; and I have 5 of which have field_name = 1, I would like to sort the 5 by distance, and eliminate 4 of them; whilst still maintaining the first sort. (possibly doing the geodistance sort and elimination before the actual query?)
Not too sure how to do this, any help is appreciated - I'm currently using ElasticSearch DSL DRF - but I can easily convert the query to ElasticSearch DSL.
Example documents (before manipulation):
[{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]
Output (Desired):
[{
"field_name": 1,
"field_name_2": 2,
"location": .... <- closest
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]
One way to achieve what you want is to keep the query part as you have it now (so you still get the hits you need) and add an aggregation part in order to get the closest document with an additional condition on filed_name. The aggregation part would be made of:
a filter aggregation to only consider the documents with field_name = 1
a geo_distance aggregation with a very small distance
a top_hits aggregation to return the document with the closest distance
The aggregation part would look like this:
{
"query": {
...same as you have now...
},
"aggs": {
"field_name": {
"filter": {
"term": {
"field_name": 1 <--- only select desired documents
}
},
"aggs": {
"geo_distance": {
"field": "location.point",
"unit": "km",
"distance_type": "arc",
"origin": {
"lat": "51.794177",
"lon": "-0.063055"
},
"ranges": [
{
"to": 1 <---- single bucket for docs < 1km (change as needed)
}
]
},
"aggs": {
"closest": {
"top_hits": {
"size": 1, <---- closest document
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
},
"order": "asc",
"unit": "km",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
}
}
}
}
}
}
This can be done using Field Collapsing - which is the equivalent of grouping. - Below is an example of how this can be achieved:
{"collapse": {"field": "vin",
"inner_hits": {
"name": "closest_dealer",
"size": 1,
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "latitude",
"lon": "longitude"
},
"order": "desc",
"unit": "km",
"distance_type": "arc",
"nested_path": "location"
}
}
]
}
}
}
The collapsing is done on the field vin - and the inner_hits is used to sort the grouped items and get the closest one. (size = 1)

Histogram is not starting at the right min even filter added

The Mapping
"eventTime": {
"type": "long"
},
The Query
POST some_indices/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"eventTime": {
"from": 1563120000000,
"to": 1565712000000,
"format": "epoch_millis"
}
}
}
}
},
"aggs": {
"min_eventTime": { "min" : { "field": "eventTime"} },
"max_eventTime": { "max" : { "field": "eventTime"} },
"time_series": {
"histogram": {
"field": "eventTime",
"interval": 86400000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
}
}
}
The Response
"aggregations": {
"max_eventTime": {
"value": 1565539199997
},
"min_eventTime": {
"value": 1564934400000
},
"time_series": {
"buckets": [
{
"key": 1563062400000,
"doc_count": 0
},
{
"key": 1563148800000,
"doc_count": 0
},
{
...
Question
As the reference clearly mentioned
For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
I set the filter properly (as the demo does) and the min and max is also providing the evidence.
But why still the first key is SMALLER THAN than the from (or min_eventTime)?
So weird and I totally get lost now ;(
Any advice will be appreciated ;)
References
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation
I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.
I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.
"aggs": {
"time_series": {
"date_histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": "+16h",
"min_doc_count": 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
},
"aggs": {
"order_amount_total": {
"sum": {
"field": "order_amount"
}
}
}
}
}
Updated
Thanks for the help of #Val, I re-think about it and have a test as follows:
#Test
public void testComputation() {
System.out.println(1563120000000L % 86400000L); // 57600000
System.out.println(1563062400000L % 86400000L); // 0
}
I want to quote from the doc
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).
But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.
So basically in my case, I could use offset of histogram to solve the issue as follows.
I don't actually need date_histogram at all.
"histogram": {
"field": "eventTime",
"interval": 86400000,
"offset": 57600000,
"min_doc_count" : 0,
"extended_bounds": {
"min": 1563120000000,
"max": 1565712000000
}
}
A clear explanation posted by Elastic Search member #polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.
A reason for the design I want to quote here:
if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.

Elasticsearch: Using the results of a Metric Aggregation to filter the elements of a bucket and run additional aggregations

Given a dataset like
[{
"type": "A",
"value": 32
}, {
"type": "A",
"value": 34
}, {
"type": "B",
"value": 35
}]
I would like to perform the following aggregation:
Firstly, I would like to group by "type" in buckets using the terms
aggregation.
After that, I would like to calculate some metrics of the field "value" using the extended_stats.
Knowing the std_deviation_bounds (upper and lower) I would like to
calculate the average value of the elements of the bucket excluding
those outside the range [std_deviation_bounds.lower,
std_deviation_bounds.upper]
First and second point of my list are trivial. I would like to know if the third point, using information of a sibling metric aggregation result to filter out elements of the bucket and recalculate an average is possible. And, if it is, I would like to have a hint of the aggregation structure I would need to use.
The version of the Elasticsearch instance is 5.0.0
Well, OP here.
I still don't know if ElasticSearch allows to formulate an aggregation as I described in the original question.
What I did to solve this problem is taking a different approach. I will post it here just in case it is helpful to anyone else.
so,
POST hostname:9200/index/type/_search
with
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"group": {
"terms": {
"field": "type"
},
"aggs": {
"histogramAgg": {
"histogram": {
"field": "value",
"interval": 10,
"offset": 0,
"order": {
"_key": "asc"
},
"keyed": true,
"min_doc_count": 0
},
"aggs": {
"statsAgg": {
"stats": {
"field": "value"
}
}
}
},
"extStatsAgg": {
"extended_stats": {
"field": "value",
"sigma": 2
}
}
}
}
}
}
will generate a result like this
{
"took": 100,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 100000,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"group": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "A",
"doc_count": 10000,
"histogramAgg": {
"buckets": {
"0.0": {
"key": 0.0,
"doc_count": 1234,
"statsAgg": {
"count": 1234,
"min": 0.0,
"max": 9.0,
"avg": 0.004974220783280196,
"sum": 7559.0
}
},
"10.0": {
"key": 10.0,
"doc_count": 4567,
"statsAgg": {
"count": 4567,
"min": 10.0,
"max": 19.0,
"avg": 15.544345993923,
"sum": 331846.0
}
},
[...]
}
},
"extStatsAgg": {
"count": 10000,
"min": 0.0,
"max": 104.0,
"avg": 16.855123857,
"sum": 399079395E10,
"sum_of_squares": 3.734838645273888E15,
"variance": 1.2690056384124432E9,
"std_deviation": 35.10540102369,
"std_deviation_bounds": {
"upper": 87.06592590438,
"lower": -54.35567819038
}
}
},
[...]
]
}
}
}
If you pay attention to the results of the group aggregation for type:"A" you will notice we have now the average and the count of every sub-group of the histogram.
You will have noticed too the results of the extStatsAgg aggregation (sibling of the histogram aggregation) shows the std_deviation_bounds for every bucket group (for type:"A", type:"B",...)
As you may have noticed, this doesn't give the solution I was looking for.
I needed to do a few calculations on my code. Example in pseudoCode
for bucket in buckets_groupAggregation
Long totalCount = 0
Double accumWeightedAverage = 0.0
ExtendedStats extendedStats = bucket.extendedStatsAggregation
Double upperLimit = extendedStats.std_deviation_bounds.upper
Double lowerLimit = extendedStats.std_deviation_bounds.lower
Histogram histogram = bucket.histogramAggregation
for group in histogram
Stats stats = group.statsAggregation
if group.key > lowerLimit & group.key < upperLimit
totalCount += group.count
accumWeightedAverage += group.count * stats.average
Double average = accumWeightedAverage / totalCount
Notes:
The size of the histogram interval will determine the "accuracy" of the final average. Finer interval will get more accurate results while increasing the aggregation time.
I hope it helps someone else

Sorting elasticsearch Range Aggregation

Is it possible to change the sorting of results of a Range Aggregation in elasticsearch? I have a keyed Range query in elasticsearch and want to sort according to keys instead of doc_count.
My documents are:
POST /docs/doc/1
{
"price": 12
}
POST /docs/doc/2
{
"price": 8
}
POST /docs/doc/3
{
"price": 15
}
And the aggregation query:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": 0
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}
The result for this query is:
"aggregations": {
"price_ranges": {
"buckets": {
"to10": {
"from": 0,
"from_as_string": "0.0",
"to": 10,
"to_as_string": "10.0",
"doc_count": 2
},
"all": {
"from": 0,
"from_as_string": "0.0",
"doc_count": 4
},
"from11": {
"from": 11,
"from_as_string": "11.0",
"doc_count": 2
}
}
}
}
I'd like to sort the results according to the key, not according to range value. According to elasticsearch documentation it is not possible to specify a sort order and When specifying a sort order I get the following exception:
"reason": "Unknown key for a START_ARRAY in [price_ranges]: [order]."
Any ideas on how to cope with this? Thanks!
Since the keys seem to be ordered according to ascending values of the from value, you can "cheat" a little bit and modify the from value of the all bucket to -1, then the all bucket will appear first, then to10 and finally from11:
POST /docs/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"keyed": true,
"ranges": [
{
"key": "all",
"from": -1
},
{
"key": "to10",
"from": 0,
"to": 10
},
{
"key": "from11",
"from": 11
}
]
}
}
}
}
In general you can use a bucket_sort aggregation. Then you can sort by _key, _count or a sub-aggregation (i.e. by a metric calculated for each bucket).
Given that you only have three fixed buckets though, I would simply do the sorting on the client-side instead.

how to build a range aggregation on parent by minimum value in children docs

I have a parent/child relationship created between Product and Pricing documents. A Product has many Pricing, each with it's own subtotal field, and I'd simply like to create a range aggregation that only considers the minimum subtotal for each product and filters out the others.
I think this is possible using nested aggregations and filters, but this is the closest I've gotten:
POST /test_index/Product/_search
{
"aggs": {
"offered-at": {
"children": {
"type": "Pricing"
},
"aggs": {
"prices": {
"aggs": {
"min_price": {
"min": {
"field": "subtotal"
},
"aggs": {
"min_price_buckets": {
"range": {
"field": "subtotal",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 200
},
{
"from": 200
}
]
}
}
}
}
}
}
}
}
}
}
However this results in the error nested: AggregationInitializationException[Aggregator [min_price] of type [min] cannot accept sub-aggregations]; }], which sort of makes sense because once you reduce to a single value there is nothing left to aggregate.
But how can I structure this so that the range aggregation is only pulling the minimum value from each set of children?
(here is a sense with mappings and test data : http://sense.qbox.io/gist/01b072b4566ef6885113dc94a796f3bdc56f19a9)

Resources