How to sort before aggregation in search query? - elasticsearch

Here is what I am trying to achieve. Let's say that my origin location is Chicago, IL. I get over 5,000 listings from Chicago if I search listings.
I want to get nearby cities from these listings. So, I have decided to used aggregations in search query to get the nearby cities. This is the example output: [{Illinois {[{Ottawa} {Chicago} ... ]}}.
However, this does not give me nearby ones first. I believe this is because listings(when 5,000 are searched) are not sorted by nearby ones. The default of aggregations size is set to 10, so if the nearby listings are way behind in the listings, they will not show in the result.
I have tried to use sort outside of aggregations but there was no difference.
"sort": [
{
"_geo_distance": {
"distance_type": "plane",
"location": {
"lat": 41.8781136,
"lan": -87.6297982
},
"mode": "min",
"order": "asc",
"unit": "mi",
}
}
]
So, I believe I need to use sort inside of the aggregation but I am not sure how I could do this with this nested aggregation.
My listings model have this: (but nothing about the distance so I am pretty sure I need to get that by using _geo_distance)
Listing = {
state: "Illinois"
city: "Chicago"
...
}
Here is my search query:
"aggs": {
"states": {
"aggs": {
"cities": {
"terms": {
"fields": "city.keyword"
}
}
},
"terms": {
"field": "state.keyword"
}
}
},
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100mi",
"distance_type": "plane",
"location": {
"lat": 41.8910166,
"lon": -87.6198982,
}
}
}
...(rest of the search query)
I do not want to sort bucket after the aggregation. I want to sort by nearby ones first then apply the aggregation to have nearby ones in the result. Any help please?

Related

ElasticSearch - Filtering a result and manipulating the documents

I have the following query - which works fine (this might not be the actual query):
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "location",
"query": {
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
}
},
{
"geo_distance": {
"distance": "16090km",
"distance_type": "arc",
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
}
}
}
]
}
}
}
Although I want to do the following (as part of the query but not affecting the existing query):
Find all documents that have field_name = 1
On all documents that have field_name = 1 run ordering by geo_distance
Remove duplicates that have field_name = 1 and the same value under field_name_2 = 2 and leave the closest item in the documents result, but remove the rest
Update (further explanation):
Aggregations can't be used as we want to manipulate the documents in the result.
Whilst also maintaining the order within the documents; meaning:
If I have 20 documents, sorted by a field; and I have 5 of which have field_name = 1, I would like to sort the 5 by distance, and eliminate 4 of them; whilst still maintaining the first sort. (possibly doing the geodistance sort and elimination before the actual query?)
Not too sure how to do this, any help is appreciated - I'm currently using ElasticSearch DSL DRF - but I can easily convert the query to ElasticSearch DSL.
Example documents (before manipulation):
[{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 1,
"field_name_2": 2,
"location": ....
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]
Output (Desired):
[{
"field_name": 1,
"field_name_2": 2,
"location": .... <- closest
},
{
"field_name": 55,
"field_name_5": 22,
"location": ....
}]
One way to achieve what you want is to keep the query part as you have it now (so you still get the hits you need) and add an aggregation part in order to get the closest document with an additional condition on filed_name. The aggregation part would be made of:
a filter aggregation to only consider the documents with field_name = 1
a geo_distance aggregation with a very small distance
a top_hits aggregation to return the document with the closest distance
The aggregation part would look like this:
{
"query": {
...same as you have now...
},
"aggs": {
"field_name": {
"filter": {
"term": {
"field_name": 1 <--- only select desired documents
}
},
"aggs": {
"geo_distance": {
"field": "location.point",
"unit": "km",
"distance_type": "arc",
"origin": {
"lat": "51.794177",
"lon": "-0.063055"
},
"ranges": [
{
"to": 1 <---- single bucket for docs < 1km (change as needed)
}
]
},
"aggs": {
"closest": {
"top_hits": {
"size": 1, <---- closest document
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "51.794177",
"lon": "-0.063055"
},
"order": "asc",
"unit": "km",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
}
}
}
}
}
}
This can be done using Field Collapsing - which is the equivalent of grouping. - Below is an example of how this can be achieved:
{"collapse": {"field": "vin",
"inner_hits": {
"name": "closest_dealer",
"size": 1,
"sort": [
{
"_geo_distance": {
"location.point": {
"lat": "latitude",
"lon": "longitude"
},
"order": "desc",
"unit": "km",
"distance_type": "arc",
"nested_path": "location"
}
}
]
}
}
}
The collapsing is done on the field vin - and the inner_hits is used to sort the grouped items and get the closest one. (size = 1)

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).
I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Elasticsearch apply condintions in query on basis of results count

Is there any way in Elasticsearch for following type of outcome
"Apply first condition, if no results found then apply next conditions and so on.."
I am aware of basics of ES queries. I know this can be done by querying again and again on results basis but I want to do this in single query for the sake of time and efficiency.
Here is my current query
GET_search{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 28.6143519,
"lon": -81.50773
},
"bottom_right": {
"lat": 28.3479859,
"lon": -81.22977
}
}
}
}
]
}
}
}
},
"size": 10,
"from": 0,
"sort": {
"search_score": {
"order": "desc"
}
}
}
Now what I want to do is, if this query hits zero results then this should search for another increased set of lat lon bounds. I can do this by requering elasticsearch but it will be an inefficient way.
I want to know if is this possible in elasticsearch?

With Elasticsearch function score query with decay against a geo-point, is it possible to set a target distance?

It doesn't seem possible based on https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html , but I'd like confirmation.
In plain English, I'm asking to score the results (with geo-point locations) by how close they are to 500km from some latitude, longitude origin.
It's confusing because there is a parameter called "offset" but according to the documentation it doesn't seem to be an offset from origin (eg. distance) but instead seems to mean "threshold" instead.
I see a few ways to accomplish this:
A. One way would be to simply sort by distance in reverse order from the origin. You'd use a geo_distance query and then sort by distance. In the following query, the most distant documents will come up first, i.e. the sort value is the distance from the origin and we're sorting in decreasing order.
{
"query": {
"filtered": {
"filter": {
"geo_distance": {
"from" : "100km",
"to" : "200km",
"location": {
"lat": 10,
"lon": 20
}
}
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 10,
"lon": 20
},
"order": "desc",
"unit": "km",
"distance_type": "plane"
}
}
]
}
B. The second way involves using a geo_distance_range query in order to define a "ring" around the origin. The width of that ring could somehow symbolize the offset + scale you'd use in a gauss function (although there would be no decay). Here we define a ring that is 10km wide at 500km distance from the origin point and sort the documents by distance in that ring.
{
"query": {
"filtered": {
"filter": {
"geo_distance_range": {
"from": "495km",
"to": "505km",
"location": {
"lat": 10,
"lon": 20
}
}
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 10,
"lon": 20
},
"order": "desc",
"unit": "km",
"distance_type": "plane"
}
}
]
}
C. The last way is a bit more involved. We're basically after an "inverse gauss" shape, basically this figure (33), but upside-down, or this one which better represents the donut shape we're after. We can combine solution B above with a gauss function that would only score within that ring. In the query below, we're basically saying that we're only interested in the locations around 500km from the origin and we let a gauss function kick in only for those documents. It's not perfect, though, but might be close enough to what you need.
{
"query": {
"filtered": {
"filter": {
"geo_distance_range": {
"from": "495km",
"to": "505km",
"location": {
"lat": 10,
"lon": 20
}
}
},
"query": {
"function_score": {
"functions": [
{
"gauss": {
"location": {
"origin": {
"lat": 10,
"lon": 20
},
"offset": "500km",
"scale": "5km"
}
}
}
]
}
}
}
},
"sort": {
"_score": "desc"
}
}

Elasticsearch: Limit filtered query to 5 items per type per day

I'm using elasticsearch to gather data for my frontpage on my event-portal. the current query is as follows:
{
"query": {
"function_score": {
"filter": {
"and": [
{
"geo_distance": {
"distance": "50km",
"location": {
"lat": 50.78,
"lon": 6.08
},
"_cache": true
}
},
{
"or": [
{
"and": [
{
"term": {
"type": "event"
}
},
{
"range": {
"datetime": {
"gt": "now"
}
}
}
]
},
{
"not": {
"term": {
"type": "event"
}
}
}
]
}
]
},
"functions": [
...
]
}
}
}
So basically all events in an 50km distance which are future events or other types. Other types could be status, photo, video, soundcloud etc... All these items have a datetime field and a parent field which account the items belongs to. There are some functions after the filter for scoring objects based on there distance and age.
Now my question:
Is there a way to filter the query to get only the first (or even better highest scored) 5 items per type per account per day?
So currently I have accounts which upload 20 images at the same time. This is too much to display on the frontpage.
I thought about using filter scripts in a post_filter. But i am not very familiar with this topic.
Any ideas?
many thanks in advance
DTFagus
I solved it this way:
"aggs": {
"byParent": {
"terms": {
"field": "parent_id"
},
"aggs": {
"byType": {
"terms": {
"field": "type"
},
"aggs": {
"perDay": {
"date_histogram" : {
"field" : "datetime",
"interval": "day"
},
"aggs": {
"topHits": {
"top_hits": {
"size": 5,
"_source": {
"include": ["path"]
}
}
}
}
}
}
}
}
}
}
Unfortunately there is no pagination for aggregations (or other way around: the pagination of the query is not used). So I will get the paginated query results and the aggregation of all hits and intersect the arrays in js. Does not sound very efficient but I currently have no better idea. Anyone?
The only way around this I see would be to index all data into two indices. One containing all data and one with only the top 5 per day per type per account. This would be less time consuming to query but more time and storage consuming while indexing :/
You can limit the number of results returned by your query using the "size" parameter.if you set size to 5, then you will get the first 5 results returned by your query.
Check the documentation http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/pagination.html
Hope this helps!

Resources