I am trying to get average of grouped minimum values of a field.
In the image, I am getting two buckets with min_time property in them. I need to get average of this min_time. So, the final result should have only one bucket with average of min_time. I think, it can be achieved through piping but not quite getting it.
I think you are looking for pipeline aggregations which is a feature in version 2.0 and later
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html
Related
I want to understand how elastic search works under the hood for stats aggregation and sum aggregation.
My use case needs date histogram aggregation as primary aggregation and sum aggregation or stat aggregation as the nested aggregation. I executed queries using both the aggregations on same amount of data in Kibana. And the time both the queries took for execution was similar. So, for all our use cases we might use stats aggregation all the time if there's no performance difference between stats & sum aggregation.
I couldn't find any detailed information about internal working of these aggregations. Request to provide me with any information on it or point me to any documentation describing how these aggregations work under the hood.
Elasticsearch version : 7.1
Thank You
When in doubt, go to the source.
If you look at the implementation of StatsAggregator.java and SumAggregator.java, you'll see that they are very similar.
SumAggregator only computes a sum, while StatsAggregator computes sum, min, max, count and avg. Even though the latter seems to do more job, it is also only iterating once through the data in order to compute additional metrics, but those computations are not computationally expensive.
So if you know you need just the sum, use SumAggregator, but if you also need either min, max, count or avg, then go for StatsAggregator instead, so you only iterate once through the data.
my kibana version is 4.5.
my elastic version is 2.3.1.
see the pic1 .the uv is 7665.
but see the pic2.the uv is 7845.
why diffrent ?
kibana unique count seeing not correct.
If these charts are based on live data, then I doubt both the graphs cannot show the same count since you're having two different time-range in both the graphs.
In the first one your time range is yesterday, where as in the second one your trying to have an auto-refresh every minute which shows as paused. I'm assuming that you're dealing with live data so that some records might have slipped through, by the time you paused. If not I cannot see any chances of these two showing two different values.
Just being curious, how do you know that the correct count for uv should be 7665 since I can't see the exact value of uv from the snapshot of the graph? Did you double check from your ES indice through a query?
EDIT:
Interestingly Unique counts are based on the cardinality aggregation, which is designed to work efficiently across very large amounts of data and delivers an approximate result, which may why your results vary. You can maybe try increasing the precision_threshold.
To get a more correct value, add a something like: {"precision_threshold": 1000} to the "JSON Input" box for the aggregation.
Hope this helps!
I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter
As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.
I have a series of JSON documents like {"type":"A", "value": 2}, {"type":"B"," value":3}, and {"type":"C","value":7} and I feed that into elastic search.
Let's say I want to do one query to avg value all documents with "type": "A"
What is the difference between how elastic search calculates the count vs how let's say Mongo would?
Is Elastic search:
Automatically creating a "rolling count" for all those types and
incrementing the something like "typeA_sum", "typeA_count" "typeA_avg" as new
data is fed in? If so that would be sweet, because then it's not
actually having to calculate anything.
Is it just creating an
index over type and actually calculate the sum each time the query
is ran?
Is it doing #2 in the background (i.e. precalculating)
and just updating some cache value so when the query runs it has the
result pretty quickly?
It is closest to your #2, however the results are cached, so that if the results are useful in a subsequent query that will be very quick. There is no way Elasticsearch could know beforehand what query you are going to run, so #1 is impossible, and #3 would be wasteful.
However, for your example use case you probably do not need two queries, one would be enough. See for instance the stats aggregation that will return count, min, max, average and sum. Combine that with a terms aggregation (and perhaps a missing aggregation) to group the documents on your type field, and you'll get count and average (and min, max, sum) for all types separately with a single query.
ElasticSearch builds the aggregation results based on all the hits of the query independently of the from and size parameters. This is what we want in most cases, but I have a particular case in which I need to limit the aggregation to the top N hits. The limits filter is not suitable as it does not fetch the best N items but only the first X matching the query (per shard) independently of their score.
Is there any way to build a query whose hit count has an upper limit N in order to be able to build an aggregation limited to those top N results? And if so how?
Subsidiary question: Limiting the score of matching documents could be an alternative even though in my case I would require a fixed bound. Does the min_score parameter affect aggregation?
You are looking for Sampler Aggregation.
I have a similar answer explained here
Optionally, you can use the field or script and max_docs_per_value
settings to control the maximum number of documents collected on any
one shard which share a common value.
If you are using an ElasticSearch cluster with version > 1.3, you can use top_hits aggregation by nesting it in your aggregation, ordering on the field you want and set the size parameter to X.
The related documentation can be found here.
I need to limit the aggregation to the top N hits
With nested aggregations, your top bucket can represent those N hits, with nested aggregations operating on that bucket. I would try a filter aggregation for the top level aggregation.
The tricky part is to make use the of _score somehow in the filter and to limit it exactly to N entries... There is a limit filter that works per shard, but I don't think it would work in this context.
It looks like Sampler Aggregation can now be used for this purpose. Note that it is only available as of Elastic 2.0.