I want to understand how elastic search works under the hood for stats aggregation and sum aggregation.
My use case needs date histogram aggregation as primary aggregation and sum aggregation or stat aggregation as the nested aggregation. I executed queries using both the aggregations on same amount of data in Kibana. And the time both the queries took for execution was similar. So, for all our use cases we might use stats aggregation all the time if there's no performance difference between stats & sum aggregation.
I couldn't find any detailed information about internal working of these aggregations. Request to provide me with any information on it or point me to any documentation describing how these aggregations work under the hood.
Elasticsearch version : 7.1
Thank You
When in doubt, go to the source.
If you look at the implementation of StatsAggregator.java and SumAggregator.java, you'll see that they are very similar.
SumAggregator only computes a sum, while StatsAggregator computes sum, min, max, count and avg. Even though the latter seems to do more job, it is also only iterating once through the data in order to compute additional metrics, but those computations are not computationally expensive.
So if you know you need just the sum, use SumAggregator, but if you also need either min, max, count or avg, then go for StatsAggregator instead, so you only iterate once through the data.
Related
I have a search query something like this: ((q1 AND q2) OR (q3 AND q4)). I can replicate it using must, should in Elasticsearch's search or scroll API query. But I want to know how many such queries can be combined in single query in Elasticsearch. I have around 1000 OR queries, will it have negative impact on performance if I pass 1000 queries in should clause? And is there any limit on number of queries? I know there is limit on clause of query which can be max_clause_count. Is there similar limit on number of queries?
There is no hardcoded limitation.
This will depend on your hardware and data, try it and ask question if you got any error!
To improve:
Use filter
Use constant_score to "disable" tf/idf calcul (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html)
Just by curiosity, why do you need 1000 queries?
I am trying to get average of grouped minimum values of a field.
In the image, I am getting two buckets with min_time property in them. I need to get average of this min_time. So, the final result should have only one bucket with average of min_time. I think, it can be achieved through piping but not quite getting it.
I think you are looking for pipeline aggregations which is a feature in version 2.0 and later
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html
I have more that 2 million documents which has price and discount. I have to get percentage of products with 10%, 20%, 30%,......, 90%, 100% discount(discounts are rounded off). Its not possible to fetch data and aggregate it on application layer at it will take too much time. I am also afraid that it will create lag for other users as thread will be busy for a long time.
Is there any way I can create custom filters upon aggregation logic?
It doesn't look like you need a custom filter here. This functionality is a standard part of the histogram aggregation. You can also take a look at the range aggregation in case you need more flexible ranges.
If you need a complete flexibility in how terms are calculated you can also use script in terms aggregation to return the value you want to group your records by. However, with 2 million documents, it might be better to pre-calculate the discount during before indexing the document, store this value as a separate field and then use histogram aggregation.
I have a series of JSON documents like {"type":"A", "value": 2}, {"type":"B"," value":3}, and {"type":"C","value":7} and I feed that into elastic search.
Let's say I want to do one query to avg value all documents with "type": "A"
What is the difference between how elastic search calculates the count vs how let's say Mongo would?
Is Elastic search:
Automatically creating a "rolling count" for all those types and
incrementing the something like "typeA_sum", "typeA_count" "typeA_avg" as new
data is fed in? If so that would be sweet, because then it's not
actually having to calculate anything.
Is it just creating an
index over type and actually calculate the sum each time the query
is ran?
Is it doing #2 in the background (i.e. precalculating)
and just updating some cache value so when the query runs it has the
result pretty quickly?
It is closest to your #2, however the results are cached, so that if the results are useful in a subsequent query that will be very quick. There is no way Elasticsearch could know beforehand what query you are going to run, so #1 is impossible, and #3 would be wasteful.
However, for your example use case you probably do not need two queries, one would be enough. See for instance the stats aggregation that will return count, min, max, average and sum. Combine that with a terms aggregation (and perhaps a missing aggregation) to group the documents on your type field, and you'll get count and average (and min, max, sum) for all types separately with a single query.
ElasticSearch builds the aggregation results based on all the hits of the query independently of the from and size parameters. This is what we want in most cases, but I have a particular case in which I need to limit the aggregation to the top N hits. The limits filter is not suitable as it does not fetch the best N items but only the first X matching the query (per shard) independently of their score.
Is there any way to build a query whose hit count has an upper limit N in order to be able to build an aggregation limited to those top N results? And if so how?
Subsidiary question: Limiting the score of matching documents could be an alternative even though in my case I would require a fixed bound. Does the min_score parameter affect aggregation?
You are looking for Sampler Aggregation.
I have a similar answer explained here
Optionally, you can use the field or script and max_docs_per_value
settings to control the maximum number of documents collected on any
one shard which share a common value.
If you are using an ElasticSearch cluster with version > 1.3, you can use top_hits aggregation by nesting it in your aggregation, ordering on the field you want and set the size parameter to X.
The related documentation can be found here.
I need to limit the aggregation to the top N hits
With nested aggregations, your top bucket can represent those N hits, with nested aggregations operating on that bucket. I would try a filter aggregation for the top level aggregation.
The tricky part is to make use the of _score somehow in the filter and to limit it exactly to N entries... There is a limit filter that works per shard, but I don't think it would work in this context.
It looks like Sampler Aggregation can now be used for this purpose. Note that it is only available as of Elastic 2.0.