Elasticsearch: Sort top_hits aggregation _score and then doc count

Elasticsearch: Sort top_hits aggregation _score and then doc count - elasticsearch

I am looking to sort aggregations based on _score and then the number of docs (in case of the same _score of multiple docs). What I have right now is to be able to sort by _score
"aggs": {
"name": {
"terms": {
"field": "name",
"order": {"by_score": "desc"}
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 1,
"_source": ["name"]
}
},
"by_score": {
"max": {"script": { "source": "_score" }
}
}
}
}
}

I think I found the answer here Elasticsearch two level sort in aggregation list
The order needs to be in an array:
"order": [
{"by_score": "desc"},
{"_count": "desc"}
]

Related

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?

When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"

#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).

I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Ordering Aggregation Buckets by Score

Is it possible to order the aggregation bucket by score?
"aggs": {
"UnitAggregationBucket": {
"terms": {
"field": "unitId",
"size": 10,
/* "order": order by max score documents per bucket */
}
}
}
I have seen this document which explains the default order is doc_count, but I cannot find out if it is possible and how to order the buckets by score.

Yes, it is possible to do that like this:
{
"size": 0,
"query": {
...
},
"aggs": {
"UnitAggregationBucket": {
"terms": {
"field": "unitId",
"size": 10,
"order": {
"score": "desc"
}
},
"aggs": {
"score": {
"max": {
"script": "_score"
}
}
}
}
}
}

ElasticSearch return aggregations random order

I've got the following ElasticSearch-query, to get 10 documents from each "category" grouped on "cat.id":
"aggs": {
"test": {
"terms": {
"size": 10,
"field": "cat.id"
},
"aggs": {
"top_test_hits": {
"top_hits": {
"_source": {
"includes": [
"id"
]
},
"size": 10
}
}
}
}
}
This is working fine. However I cannot seem to find a way, to randomly take 10 results from each bucket. The results are always the same. And I would like to have 10 random items from each bucket. I tried all kinds of things which are intended for documents, but non of them seem to be working.

As was already suggested in this answer, you can try using random sort in the top_hits aggregation, using a _script like this:
{
"aggs": {
"test": {
"terms": {
"size": 10,
"field": "cat.id"
},
"aggs": {
"top_test_hits": {
"top_hits": {
"_source": {
"includes": [
"id"
]
},
"size": 10,
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "(System.currentTimeMillis() + doc['_id'].value).hashCode()"
},
"order": "asc"
}
}
}
}
}
}
}
}
Random sorting was broadly covered in this question.
Hope that helps!

How do I aggregate over top_hits results in elasticsearch

Here are example documents:
{
"player": "Jim",
"score" : 5
"timestamp": 1459492890000
}
{
"player": "Jim",
"score" : 7
"timestamp": 1459492895000
}
{
"player": "Dave",
"score" : 9
"timestamp": 1459492894000
}
{
"player": "Dave",
"score" : 4
"timestamp": 1459492898000
}
I want to get the latest score for each player and then get the average of all those scores. So the answer would be 5.5. Jim's latest score is 7 and Dave's latest score is 4. The average between those two is 5.5
The only way I found to get the "latest" document of a player was to use the top_hits aggregation. However, it does not seem that I am able to do another aggregation after I get the latest document.
This is the best I came up with:
{
"aggs": {
"last_score": {
"terms": { "field": "player" },
"aggs": {
"last_score_hits": {
"top_hits": {
"sort": [ { "timestamp": { "order": "desc" } } ],
"size": 1
},
"aggs": {
"avg_score": {
"avg": { "field": "score" }
}
}
}
}
}
}
}
However, this gives me this error:
Aggregator [last_score_hits] of type [top_hits] cannot accept
sub-aggregations
If there is another way to accomplish this search without using top_hits as well, then I would be all for it.

You're trying to put avg_score as a sub-aggregation of last_score_hits.
To get success you have to put avg_score as a sub-aggregation of last_score. See an example bellow:
{
"aggs": {
"last_score": {
"terms": {
"field": "player"
},
"aggs": {
"last_score_hits": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size": 1
}
},
"avg_score": {
"avg": {
"field": "score"
}
}
}
}
}
}

You can have other aggregation on a parallel level of top_hit but you cannot have any sub_aggregation below top_hit. It is not supported by ElasticSearch. here is the link to Github issue
You can have a parallel level aggregation like:
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 10,
"_source": {
"includes": ["score"]
}
}
},
"avg_agg": {
"avg": {
"field": "score"
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch: Sort top_hits aggregation _score and then doc count - elasticsearch

I think I found the answer here Elasticsearch two level sort in aggregation list The order needs to be in an array: "order": [ {"by_score": "desc"}, {"_count": "desc"} ]

Related

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

Ordering Aggregation Buckets by Score

ElasticSearch return aggregations random order

How do I aggregate over top_hits results in elasticsearch

Categories

Resources