Is it possible to make elasticsearch aggregations faster when I only want the top 10 buckets? - performance

I understand elasticsearch aggregation queries take a long time to execute by nature, especially on high cardinality fields. In our use case, we only need to bring back the first x buckets sorted alphabetically. Considering we only need to bring back 10 buckets, is there a way to make our query faster? Is there a way to get Elasticsearch to look at only the first 10 buckets in each shard and compare those only?
Here is my query...
{
"size": "600",
"timeout": "60s",
"query": {
"bool": {
"must": [
{
"match_all": {
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"aggregations": {
"firstname": {
"terms": {
"field": "firstname.keyword",
"size": 10,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": {
"_key": "asc"
},
"include": ".*.*",
"exclude": ""
}
}
}
}
I think I am on to something using a composite aggregation instead. Like this...
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"field": "home_address1.keyword",
"order": "asc"
}
}
}
]
}
}
Testing in Postman shows that this request takes way faster. Is this expected? If so, awesome. How can I add the include, exclude attributes to the composite query? For example, sometimes I only want to include buckets whose value matches "A.*"
If this query should not be any faster, then why does it appear so?

Composite terms aggs unfortunately don't support include, exclude and many other standard terms aggs' parameters so you've got 2 options:
Filter out the docs that don't match your prefix from within the query, like #Val pointed out.
or
use a script to do the filtering for you. This should be your last resort though -- scripts are pretty much guaranteed to run more slowly than standard query filters.
{
"size": 0,
"aggs": {
"home_address1": {
"composite": {
"size": 10,
"sources": [
{
"home_address1": {
"terms": {
"order": "asc",
"script": {
"source": """
def val = doc['home_address1.keyword'].value;
if (val == null) { return null; }
if (val.indexOf('A.') === 0) {
return val;
}
return null;
""",
"lang": "painless"
}
}
}
}
]
}
}
}
}
BTW your original terms agg already seems optimized so it surprises me that the composite is faster.

Related

ElasticSearch: Is it possible to do a "Weighted Avg Aggregation" weighted by the score?

I'm trying to perform an avg over a price field (price.avg). But I want the best matches of the query to have more impact on the average than the latests, so the avg should be weighted by the calculated score field. This is the aggregation that I'm implementing.
{
"query": {...},
"size": 100,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price.avg"
},
"weight": {
"script": "_score"
}
}
}
}
}
It should give me what I want. But instead I receive a null value:
{...
"hits": {...},
"aggregations": {
"weighted_avg_price": {
"value": null
}
}
}
Is there something that I'm missing? Is this aggregation query feasible? Is there any workaround?
When you debug what's available from within the script
GET prices/_search
{
"size": 0,
"aggs": {
"weighted_avg_price": {
"weighted_avg": {
"value": {
"field": "price"
},
"weight": {
"script": "Debug.explain(new ArrayList(params.keySet()))"
}
}
}
}
}
the following gets spit out
[doc, _source, _doc, _fields]
None of these contain information about the query _score that you're trying to access because aggregations operate in a context separate from the query-level scoring. This means the weight value needs to either
exist in the doc or
exist in the doc + be modifiable or
be a query-time constant (like 42 or 0.1)
A workaround could be to apply a math function to the retrieved price such as
"script": "Math.pow(doc.price.value, 0.5)"
#jzzfs I'm trying with the approach of "avg of the first N results (ordered by _score)", using top hits aggregation:
{
"query": {
"bool": {
"should": [
...
],
"minimum_should_match": 0
}
},
"size": 0,
"from": 0,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"aggs": {
"top_avg_price": {
"avg": {
"field": "price.max"
}
},
"aggs": {
"top_hits": {
"size": 10, // N: Changing the number of results doesn't change the top_avg_price
"_source": {
"includes": [
"price.max"
]
}
}
}
},
"explain": "false"
}
The avg aggregation is being done over the main results, not the top_hits aggregation.
I guess the top_avg_rpice should be a subaggregation of top_hits. But I think that's not possible ATM.

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

Aggregated results show less items than doc_count?

I have an ElasticSearch query which aggregates the result on a certain field, called _aggregate. Now I have this strange situation given this query:
"size": 100,
"aggregations": {
"results": {
"terms": {
"field": "_aggregate",
"size": 1000,
"order": {
"_count": "desc"
}
},
"aggregations": {
"bundled": {
"top_hits": {
"sort": [
{
"_weight": "asc"
}
]
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"_aggregate": "5713618784853"
}
}
]
}
}
}
When I do this search, it returns 8 hits (like expected). However, when I take a look at the aggregated results, I see a doc_count of 8 (so far so good), but it only returns 3 hits.
Increasing the size of the _aggregate field does not have any effect.
Does anyone know how this is possible, or what can possibly cause this?
This is because the top_hits metric aggregation returns 3 hits by default. You can override this
"aggregations": {
"bundled": {
"top_hits": {
"size": 10, <--- add this
"sort": [
{
"_weight": "asc"
}
]
}
}
}

Aggregate over multiple fields without subaggregation

I have documents in my ElasticSearch which have two fields. I want to build an aggregate over the combination of these, kind of like in SQL GROUP BY field_A, field_B and get a row per existing combination. I read everywhere that I should use subaggregation for this.
{
"aggs": {
"sales_by_article": {
"terms": {
"field": "catalogs.article_grouping",
"size": 1000000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
},
"sales_by_submodel": {
"terms": {
"field": "catalogs.submodel_grouping",
"size": 1000,
"order": {
"total_amount": "desc"
}
},
"aggs": {
"total_amount": {
"sum": {
"script": "Math.round(doc['amount.value'].value*100)/100.0"
}
}
}
}
}
}
},
"size": 0
}
With the following simplified result:
{
"aggregations": {
"sales_by_article": {
"buckets": [
{
"key": "19114",
"total_amount": {
"value": 426794.25
},
"sales_by_submodel": {
"buckets": [
{
"key": "12",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
},
...
]
}
}
}
However, the problem with this is that the ordering is not what I want. In this particular case, it first orders the articles based on total_amount per article, and then within an article it orders the submodels based on total_amount per submodel. However, what I want to achieve is to only have the deepest level and get an aggregation for the combination of article and submodel, ordered by the total_amount of this combination. This is the result I would like:
{
"aggregations": {
"sales_by_article_and_submodel": {
"buckets": [
{
"key": "1911412",
"total_amount": {
"value": 51512.200000000004
}
},
...
]
}
}
}
It's discussed in the docs a bit here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation
Basically you can use a script to create a term which is derived from each document (using as many fields as you want) at query run time, but it will be slow. If you are doing it for ad hoc analysis, it'll work fine. If you need to serve these requests at some high rate, then you probably want to make a field in your model that is a combination of the two fields you're interested in, so the index is populated for you already.
Example query using the script approach:
GET agreements/agreement/_search?size=0
{
"aggs" : {
"myAggregationName" : {
"terms" : {
"script" : {
"source": "doc['owningVendorCode'].value + '|' + doc['region'].value",
"lang": "painless"
}
}
}
}
}
I have learned I should use composite aggregates for this.

Removing duplicates and sorting (aggs + sort)

I'm trying to find the best solution where a query returns a sorted set, which I then use aggs to remove duplicates, this works fine, however when I add a sort on the query results, e.g.
"query": {..},
"sort": {.. "body.make": "asc" ..}
I'd like the aggs to also return the results in that order, however it seems to always order on the query score.
// Here I'm collecting all body.vin values to remove duplicates
// and then returning only the first in each result set.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
},
I've tried to put a term aggregation in between to see if that would sort:
// here again same thing, however I attempt to sort on body.make
// in the document, however I now realize that my bucket result
// being each a collection of the duplicates, will sort each duplicate
// and not on the last results.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"order": {
"terms": {
"field": "body.make",
"order": {
"_term": "asc"
}
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},
But the results from the aggregation are always based on score.
Also I've toyed with the idea or solution of adjusting the scores based on query sort, in this way the aggregation would return the proper order as it returns based on score, but there doesn't seem to be anyway of doing this with the sort: {}.
If anyone has had success in sorting results, while removing duplicates, or ideas/suggestions, please let me know.
This is not the most ideal solution since it will only allow the sorting on one field. The best would be to change scores/boosts on sorted results
Trying to explain it made me realize how this could be done once I grasped the concept of buckets, or more so how they are passed. I would still be interested in the sort + score adjust solution but via aggregates this works:
// here we first aggregate all body.make, so first results might
// {"toyota": {body.vin 123}, "toyota": {body.vin 123}...} and the
// next result passed into the dedup aggregate would be say
// {"nissan"...
"aggs": {
"sort": {
"terms": {
"size": 8,
"field": "body.make",
"order": {
"_term": "desc"
}
},
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},

Resources