ElasticSearch - Filtering a result and manipulating the documents - elasticsearch

I have the following query - which works fine (this might not be the actual query):
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "location",
"query": {
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
}
},
{
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
]
}
}
}
Although I want to do the following (as part of the query but not affecting the existing query):
Find all documents that have field_name = 1
On all documents that have field_name = 1 run ordering by geo_distance
Remove duplicates that have field_name = 1 and the same value under field_name_2 = 2 and leave the closest item in the documents result, but remove the rest
Update (further explanation):
Aggregations can't be used as we want to manipulate the documents in the result.
Whilst also maintaining the order within the documents; meaning:
If I have 20 documents, sorted by a field; and I have 5 of which have field_name = 1, I would like to sort the 5 by distance, and eliminate 4 of them; whilst still maintaining the first sort. (possibly doing the geodistance sort and elimination before the actual query?)
Not too sure how to do this, any help is appreciated - I'm currently using ElasticSearch DSL DRF - but I can easily convert the query to ElasticSearch DSL.
Example documents (before manipulation):
[{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]
Output (Desired):
[{
"field_name": 1,
"field_name_2": 2,
"location": .... <- closest
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]

One way to achieve what you want is to keep the query part as you have it now (so you still get the hits you need) and add an aggregation part in order to get the closest document with an additional condition on filed_name. The aggregation part would be made of:
a filter aggregation to only consider the documents with field_name = 1
a geo_distance aggregation with a very small distance
a top_hits aggregation to return the document with the closest distance
The aggregation part would look like this:
{
"query": {
...same as you have now...
},
"aggs": {
"field_name": {
"filter": {
"term": {
"field_name": 1 <--- only select desired documents
}
},
"aggs": {
"geo_distance": {
"field": "location.point",
"unit": "km",
"distance_type": "arc",
"origin": {
"lat": "51.794177",
"lon": "-0.063055"
},
"ranges": [
{
"to": 1 <---- single bucket for docs < 1km (change as needed)
}
]
},
"aggs": {
"closest": {
"top_hits": {
"size": 1, <---- closest document
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
},
"order": "asc",
"unit": "km",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
}
}
}
}
}
}

This can be done using Field Collapsing - which is the equivalent of grouping. - Below is an example of how this can be achieved:
{"collapse": {"field": "vin",
"inner_hits": {
"name": "closest_dealer",
"size": 1,
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "latitude",
"lon": "longitude"
},
"order": "desc",
"unit": "km",
"distance_type": "arc",
"nested_path": "location"
}
}
]
}
}
}
The collapsing is done on the field vin - and the inner_hits is used to sort the grouped items and get the closest one. (size = 1)

Related

Bucket sort in composite aggregation?

How can I do Bucket Sort in composite Aggregation?
I need to do Composite Aggregation with Bucket sort.
I have tried Sort with aggregation.
I have tried composite aggregation.
I think this question, is in continuation to your previous question, so considered the same use case
You need to use Bucket sort aggregation that is a parent pipeline
aggregation which sorts the buckets of its parent multi-bucket
aggregation. And please refer to this documentation on composite
aggregation to know more about this.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
The size parameter can be set to define how many composite buckets
should be returned. Each composite bucket is considered as a single
bucket, so setting a size of 10 will return the first 10 composite
buckets created from the values source. The response contains the
values for each composite bucket in an array containing the values
extracted from each value source. Defaults to 10.
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 3, <-- note this
"sources": [
{
"product": {
"terms": {
"field": "user"
}
}
}
]
},
"aggs": {
"mySort": {
"bucket_sort": {
"sort": [
{
"sort_user": {
"order": "desc"
}
}
]
}
},
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"my_buckets": {
"after_key": {
"product": "user3"
},
"buckets": [
{
"key": {
"product": "user3"
},
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
},
{
"key": {
"product": "user1"
},
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": {
"product": "user2"
},
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
}
]
}

Elastic Search Geo Spatial search implementation

I am trying to understand how elastic search supports Geo Spatial search internally.
For the basic search, it uses the inverted index; but how does it combine with the additional search criteria like searching for a particular text within a certain radius.
I would like to understand the internals of how the index would be stored and queried to support these queries
Text & geo queries are executed separately of one another. Let's take a concrete example:
PUT restaurants
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
},
"menu": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
POST restaurants/_doc
{
"name": "rest1",
"location": {
"lat": 40.739812,
"lon": -74.006201
},
"menu": [
"european",
"french",
"pizza"
]
}
POST restaurants/_doc
{
"name": "rest2",
"location": {
"lat": 40.7403963,
"lon": -73.9950026
},
"menu": [
"pizza",
"kebab"
]
}
You'd then match a text field and apply a geo_distance filter:
GET restaurants/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"menu": "pizza"
}
},
{
"geo_distance": {
"distance": "0.5mi",
"location": {
"lat": 40.7388,
"lon": -73.9982
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"boost_mode": "avg",
"functions": [
{
"gauss": {
"location": {
"origin": {
"lat": 40.7388,
"lon": -73.9982
},
"scale": "0.5mi"
}
}
}
]
}
}
]
}
}
}
Since the geo_distance query only assigns a boolean value (--> score=1; only checking if the location is within a given radius), you may want to apply a gaussian function_score to boost the locations that are closer to a given origin.
Finally, these scores are overridable by using a _geo_distance sort where you'd order by the proximity (while of course keeping the match query intact):
...
"query: {...},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 40.7388,
"lon": -73.9982
},
"order": "asc"
}
}
]
}

Elasticsearch - Query to Determine All Unique IDs that are distance X away from a particular ID?

I have data in this format generated from a random walk (to simulate people walking around). It is set up in this manner { location : { lat: someLat, lon: someLong }, id: uniqueId, date:date }. I am trying to write a query given a users unique ID, find how many other unique IDs came within X distance of the given ID between a certain time range. Any hints on how to accomplish this?
My idea is to have a top level filter aggregration, with a nested geo-query of some sort. I think the geo-distance query is the way to go, but I am not sure how to include it into the below query to get all of unique IDs that come within X distance of the ID I am filtering on. The query below is where I am starting from, I am filtering all documents from now - 1 day to now, where the documents user Id is the provided value. How would I check all other documents for their distances against documents that match this query?
{
"aggs" : {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyyy",
"ranges": [
{ "to": "now" },
{ "from": "now-1d" }
]
}
},
"locations" : {
"filter" : {
"term": { "id.keyword": "7a50ab18-886b-42a2-80ad-3d45112e3cfd" }
}
}
}
}
Your hunch is correct. All of this can be done using range & geo_distance filtering and _geo_distance sorting. You wanna filter on the query-level, not in the aggs though:
GET walking/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"date": {
"gte": "now-1d"
}
}
}
],
"filter": [
{
"geo_distance": {
"distance": "20m",
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
}
}
}
]
}
},
"aggs": {
"rings_around_loc": {
"geo_distance": {
"field": "location",
"origin": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"unit": "m",
"keyed": true,
"ranges": [
{
"to": 10
},
{
"from": 10,
"to": 50
},
{
"from": 50
}
]
}
},
"locations": {
"value_count": {
"field": "id.keyword"
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"order": "asc",
"unit": "m",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
Not sure what you need the range buckets for so I left them out.
Full steps to replicate:
PUT walking
{
"mappings": {
"properties": {
"date": {
"type": "date"
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
And then POST _bulk this random walk data

With Elasticsearch function score query with decay against a geo-point, is it possible to set a target distance?

It doesn't seem possible based on https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html , but I'd like confirmation.
In plain English, I'm asking to score the results (with geo-point locations) by how close they are to 500km from some latitude, longitude origin.
It's confusing because there is a parameter called "offset" but according to the documentation it doesn't seem to be an offset from origin (eg. distance) but instead seems to mean "threshold" instead.
I see a few ways to accomplish this:
A. One way would be to simply sort by distance in reverse order from the origin. You'd use a geo_distance query and then sort by distance. In the following query, the most distant documents will come up first, i.e. the sort value is the distance from the origin and we're sorting in decreasing order.
{
"query": {
"filtered": {
"filter": {
"geo_distance": {
"from" : "100km",
"to" : "200km",
"location": {
"lat": 10,
"lon": 20
}
}
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 10,
"lon": 20
},
"order": "desc",
"unit": "km",
"distance_type": "plane"
}
}
]
}
B. The second way involves using a geo_distance_range query in order to define a "ring" around the origin. The width of that ring could somehow symbolize the offset + scale you'd use in a gauss function (although there would be no decay). Here we define a ring that is 10km wide at 500km distance from the origin point and sort the documents by distance in that ring.
{
"query": {
"filtered": {
"filter": {
"geo_distance_range": {
"from": "495km",
"to": "505km",
"location": {
"lat": 10,
"lon": 20
}
}
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 10,
"lon": 20
},
"order": "desc",
"unit": "km",
"distance_type": "plane"
}
}
]
}
C. The last way is a bit more involved. We're basically after an "inverse gauss" shape, basically this figure (33), but upside-down, or this one which better represents the donut shape we're after. We can combine solution B above with a gauss function that would only score within that ring. In the query below, we're basically saying that we're only interested in the locations around 500km from the origin and we let a gauss function kick in only for those documents. It's not perfect, though, but might be close enough to what you need.
{
"query": {
"filtered": {
"filter": {
"geo_distance_range": {
"from": "495km",
"to": "505km",
"location": {
"lat": 10,
"lon": 20
}
}
},
"query": {
"function_score": {
"functions": [
{
"gauss": {
"location": {
"origin": {
"lat": 10,
"lon": 20
},
"offset": "500km",
"scale": "5km"
}
}
}
]
}
}
}
},
"sort": {
"_score": "desc"
}
}

ElasticSearch 2 bucket level sorting

The mapping of database is this:
{
"users": {
"mappings": {
"user": {
"properties": {
credentials": {
"type": "nested",
"properties": {
"achievement_id": {
"type": "string"
},
"percentage_completion": {
"type": "integer"
}
}
},
"current_location": {
"type": "geo_point"
},
"locations": {
"type": "geo_point"
}
}
}
}
}
Now In the mapping, You can see there are two geo-distance fields one is current_location and other is locations. Now I want to sort user based on credentials.percentage_completion which is a nested field. This work fine for example this query,
Example Query:
GET /users/user/_search?size=23
{
"sort": [
{
"credentials.percentage_completion": {
"order": "desc",
"missing": "_last"
}
},
"_score"
],
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "100000000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
}
}
}
}
I want to change sorting order made into buckets, the desired order is first show all the people who are at 100KM radius of user.current_location and sort them according to credentials.percentage_completion and then rest of users sorted again by credentials.percentage_completion.
I tried putting conditional in sorting and made it multilevel but that will not work because only nested can have filters and that on nested fields child only.
I thought I can use _score for sorting and give more relevance to people who are under 1000 km but geo-distance is a filter, I don't seem to find any way to give relevance in filter.
Is there anything I am missing here , any help would be great.
Thanks
Finally solved it, posting it here so other can also take some lead if they get here. The way to solve this is to give constant relevance score to particular query but as here it was Geo distance so was not able to use that in query, then I found Constant Score query: It allows to wrap a filter inside a query.
This is how query looks:
GET /users/user/_search?size=23
{
"sort": [
"_score",
{
"credentials.udacity_percentage_completion": {
"order": "desc",
"missing": "_last"
}
}
],
"explain": true,
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"constant_score": {
"filter": {
"geo_distance": {
"distance": "100km",
"user.current_location": {
"lat": 19.77,
"lon": 73
}
}
},
"boost": 50
}
},
{
"constant_score": {
"filter": {
"geo_distance": {
"distance": "1000000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
},
"boost": 1
}
}
]
}
},
"filter": {
"geo_distance": {
"distance": "10000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
}
}
}
}

Resources