ElasticSearch - How to Aggregate the Geometric Mean? - elasticsearch

I'm currently aggregating records to get the average (arithmetic average) of a field in the returned records. My use case requires me to get hold of the geometric average:
The geometric mean is defined as the nth root of the product of n
How could I go about getting this value? I don't even know where to start!
Thanks!

It is not trivial, but it can be done. The idea is to use a sum of logs and then apply the n-th root:
pow(exp((sum of logs)), 1/n)
In fact, GeometricMean aggregation of Elasticsearch Index Termlist Plugin does exactly that. (However, this is a third-party plugin, I can't tell if it is stable enough.)
Mapping and sample data
Let's assume we have the following mapping:
PUT geom_mean
{
"mappings": {
"nums": {
"properties": {
"x": {
"type": "double"
}
}
}
}
}
And we insert the following documents:
{"x":33}
{"x":324}
{"x":134}
{"x":0.1}
Now we can try the query.
The ES query
Here is the query to calculate geometric mean:
POST geom_mean/nums/_search
{
"size": 0,
"aggs": {
"aggs_root": {
"terms": {
"script": "'Bazinga!'"
},
"aggs": {
"sum_log_x": {
"sum": {
"script": {
"inline": "Math.log(doc.x.getValue())"
}
}
},
"geom_mean": {
"bucket_script": {
"buckets_path": {
"sum_log_x": "sum_log_x",
"x_cnt": "_count"
},
"script": "Math.pow(Math.exp(params.sum_log_x), 1 / params.x_cnt)"
}
}
}
}
}
}
The return value will be:
"aggregations": {
"aggs_root": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Bazinga!",
"doc_count": 4,
"sum_log_x": {
"value": 11.872505784215674
},
"geom_mean": {
"value": 19.455434622111177
}
}
]
}
}
Now a bit of explanation. Aggregation sum_log_x computes the sum of x. Aggregation named geom_mean is a sibling pipeline aggregation which is applied on the result of sum_log_x aggregation (its sibling). It uses special bucket path _count to get the number of elements. (Here you can read about bucket_script aggregation a bit more.)
The final trick is to wrap both of them with some aggregation, because, as explained in this issue, bucket_script cannot be a top-level aggregation. Here I do a terms aggregation on a script that always returns 'Bazinga!'
Thanks to anhzhi who proposed this hack.
Important considerations
Since the geometric mean is computed through logs, all x values should be greater than 0. However:
if any of values are < 0, the result is "NaN"
if all values are non-negative and less than "+Infinity", but at least one value is 0, the result is "-Infinity"
if both "+Infinity" and "-Infinity" are among the values, the result is "NaN".
The queries were tested with Elasticsearch 5.4. Performance on a large collection of documents was not tested, you might consider inserting x together with its log to make aggregations more efficient.
Hope that helps!

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Which is the most effective way to get all the results of aggregation

I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

how to get the total value of field in kibana 4.x

I need to get the total value of field in Kibana using script field.
e.g) if id field have values like 1, 2, 3, 4, 5
I am looking to sum up all the values of id, I am expecting the output is 15.
I need to achieve the below formula after getting total of each field.
lifetime=a-b-c-(d-e-f)*g
here a,b,c,d,e,f,g all are total of the each field values.
for more info please refer this question which is raised by me.
You could do something like this in your scripted fields:
doc['id'].value
Hence, you could use a sum aggregation to get the total value in Kibana.
This SO could be handy!
EDIT
If you're trying to do it using Elasticsearch, you could do something like this within your request body:
"aggs":{
"total":{
"sum":{
"script":"doc['id'].value"
}
}
}
You could follow up this ref, but then if you're using painless make sure you do include it within lang. related SO
You can definitely use sum aggregations to get the sum of id, but to further equate your formula, you can take a look at pipeline aggregations to use the sum value for further calculations.
Take a look at bucket script aggregation, with proper bucket path to sum aggregator you can achieve your solution.
For sample documents
{
"a":100,
"b":200,
"c":400,
"d":600
}
query
{
"size": 0,
"aggs": {
"result": {
"terms": {"script":"'nice to have it here'"},
"aggs": {
"suma": {
"sum": {
"field": "a"
}
},
"sumb": {
"sum": {
"field": "b"
}
},
"sumc": {
"sum": {
"field": "c"
}
},
"equation": {
"bucket_script": {
"buckets_path": {
"suma": "suma",
"sumb": "sumb",
"sumc" : "sumc"
},
"script": "suma + sumb + 2*sumc"
}
}
}
}
}
}
Now you can surely add term filter on each sum agg to filter the summation for each sum aggregator.

ElasticSearch max score

I'm trying to solve a performance issue we have when querying ElasticSearch for several thousand results. The basic idea is that we do some post-query processing and only show the Top X results ( Query may have ~100000 Results while we only need the top 100 according to our Score Mechanics ).
The basic mechanics are as follows:
ElasticSearch Score is normalized between 0..1 ( score/max(score) ), we add our ranking score ( also normalized between 0..1 ) and divide by 2.
What I'd like to do is move this logic into ElasticSearch using custom scoring ( or well, anything that works ): https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score
The Problem I'm facing is that using Score Scripts / Score Functions I can't seem to find a way to do something like max(_score) to normalize the score between 0 and 1.
"script_score" : {
"script" : "(_score / max(_score) + doc['some_normalized_field'].value)/2"
}
Any ideas are welcome.
You can not get max_score before you have actually generated the _score for all the matching documents. script_score query will first generate the _score for all the matching documents and then max_score will be displayed by elasticsearch.
According to what i can understand from your problem, You want to preserve the max_score that was generated by the original query, before you applied "script_score". You can get the required result if you do some computation at the front-end. In short apply your formula at the front end and then sort the results.
you can save your factor inside your results using script_fields query.
{
"explain": true,
"query": {
"match_all": {}
},
"script_fields": {
"total_goals": {
"script": {
"lang": "painless",
"source": """
int total = 0;
for (int i = 0; i < doc['goals'].length; ++i) {
total += doc['goals'][i];
}
return total;
""",
"params":{
"last" : "any parameters required"
}
}
}
}
}
I am not sure that I understand your question. do you want to limit the amount of results?
are you tried?
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "name" : "dennis" }
}
}
you can use sort to define sort order by default it will sorted by main query.
you can also use aggregations ( with or without function_score )
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"date": {
"scale": "3d",
"offset": "7d",
"decay": 0.1
}
}
},
{
"gauss": {
"priority": {
"origin": "0",
"scale": "100"
}
}
}
],
"query": {
"match" : { "body" : "dennis" }
}
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
Based on this github ticket it is simply impossible to normalize score and they suggest to use boolean similarity as a workaround.

Resources