Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation - elasticsearch

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).

I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Related

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?
When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"
#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

Excluding inner hits from top hits aggregation with source filter

In my query, I am using the inner_hits to return the list of nested objects that match my query.
I then add an aggregations for categoryId of my document, and then a top hit aggregation to get the display name for that category.
"aggs": {
"category": {
"terms": {
"field": "categoryId",
"size": 100
},
"aggs": {
"category_value": {
"top_hits": {
"size": 1,
"_source": {
"includes": "categoryName"
}
}
}
}
}
}
Now, when I look at the aggregation buckets, I do get a _source document with only the categoryName property, but I also get the entire inner_hits collection:
{
...
"_source": {
"categoryName": "Armchairs"
},
"inner_hits": {
"my_inner_hits": {
"hits": {
"total": 260,
"max_score": null,
"hits": [{
...
"_source": {
//nested document here
}
}
]
}
}
}
}
Is there a way to not include the inner_hits data in a top_hits aggregation?
Since you only need a single field, what I suggest you to do is to get rid of top_hits aggregation and use another terms aggregation for the name:
{
...
"aggs": {
"category": {
"terms": {
"field": "categoryId",
"size": 100
},
"aggs": {
"category_value": {
"terms": {
"field": "categoryName",
"size": 1
}
}
}
}
}
}
That will also be a little bit more efficient.
UPDATE:
Another way to keep using terms/top_hits is to leverage response filtering and only return what you need. For instance, appending this to your URL will make sure that you won't find any inner hits inside your aggregation
?filter_path=hits.hits,aggregations.**.key,aggregations.**.doc_count,aggregations.**.hits.hits.hits._source

Elasticsearch: Sort top_hits aggregation _score and then doc count

I am looking to sort aggregations based on _score and then the number of docs (in case of the same _score of multiple docs). What I have right now is to be able to sort by _score
"aggs": {
"name": {
"terms": {
"field": "name",
"order": {"by_score": "desc"}
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 1,
"_source": ["name"]
}
},
"by_score": {
"max": {"script": { "source": "_score" }
}
}
}
}
}
I think I found the answer here Elasticsearch two level sort in aggregation list
The order needs to be in an array:
"order": [
{"by_score": "desc"},
{"_count": "desc"}
]

elastic search sort aggregation by selected field

How can I sort the output from an aggregation by a field that is in the source data, but not part of the output of the aggregation?
In my source data I have a date field that I would like the output of the aggregation to be sorted by date.
Is that possible? I've looked at using "order" within the aggregation, but I don't think it can see that date field to use it for sorting?
I've also tried adding a sub aggregation which includes the date field, but again, I cannot get it to sort on this field.
I'm calculating a hash for each document in my ETL on the way in to elastic. My data set contains a lot of duplication, so I'm trying to use the aggregation on the hash field to filter out duplicates and that works fine. I need the output from the aggregation to retain a date sort order so that I can work with the output in angular.
The documents are like this:
{_id: 123,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 124,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 132,
_source: {
"hash": "0202020202020"
"user": "1"
"dateTime" : "2001/2/20 09:20:43"
"action": "Logout"
}
{_id: 200,
_source: {
"hash": "0303030303030303"
"user": "2"
"dateTime" : "2001/2/22 09:32:14"
"action": "Login"
}
So I want to use an aggregation on the hash value to remove duplicates from my set and then render the response in date order.
My query:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"action": "Login"
}
}
]
},
"size": 0,
"aggs": {
"md5": {
"terms": {
"field": "hash",
"size": 0
}
},
"size": 0,
"aggs": {
"byDate": {
"terms": {
"field": "dateTime",
"size": 0
}
}
}
}
}
}
}
}
Currently the output is ordered on the hash and I need it ordered on the date field within each hash bucket. Is that possible?
If the aggregation on "hash" is just for removing duplicates, it might work for you to simply aggregate on "dateTime" first, followed by the terms aggregation on "hash". For example:
GET my_index/test/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool": {
"must" : [
{ "term": {"action":"Login"} }
]
}
}
}
},
"size": 0,
"aggs": {
"byDate" : {
"terms": {
"field" : "dateTime",
"order": { "_term": "asc" } <---- EDIT: must specify order here
},
"aggs": {
"byHash": {
"terms": {
"field": "hash"
}
}
}
}
}
}
This way, your results would be sorted by "dateTime" first.

Resources