Perform a pipelines aggregation over the full set of potential buckets - elasticsearch

When using the _search API of Elasticsearch, if you set size to 10, and perform an avg metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits array.
On the other hand, if you perform a terms aggregation and set the size of the terms aggregation to be 10, then performing an avg_buckets aggregation on those terms buckets will calculate an average over only those 10 buckets - not all potential buckets.
How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets array?
To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.
Here are two examples of the documents I tested this on:
{
"type": "top",
"color": "black",
"price": 19,
"store": "Macy's"
}
{
"type": "fez",
"color": "red",
"price": 94,
"store": "Walmart"
}
If I want to find the average price of all the hats I have sold, I can run this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_hat_price": {
"avg": {
"field": "price"
}
}
}
}
And average_hat_price will be the same whether size is set to 0, 3, or whatever.
OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"by_store": {
"terms": {
"field": "store.keyword",
"size": 3
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"avg sales at a store": {
"avg_bucket": {
"buckets_path": "by_store>sales_count"
}
}
}
}
which yields a response of
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
}
]
},
"avg sales at a store" : {
"value" : 4.666666666666667
}
}
The problem is that avg sales at a store is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Target",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Harrod's",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Men's Warehouse",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Sears",
"doc_count" : 1,
"sales_count" : {
"value" : 1
}
}
]
},
"avg sales at a store" : {
"value" : 3.142857142857143
}
}
So the average number of hats sold per store is 3.1, not 4.6. But in the buckets array I want to see only the top 3 stores.

You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.
Here is the data setup:
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
Here is the tricky query:
GET hat_sales/_search?size=0
{
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 2
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
},
"cheat": {
"filters": {
"filters": {
"all": {
"exists": {
"field": "storename"
}
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "storename"
}
},
"unique_count": {
"cardinality": {
"field": "storename"
}
},
"total_average": {
"bucket_script": {
"buckets_path": {
"total": "count",
"unique": "unique_count"
},
"script": "params.total / params.unique"
}
}
}
}
}
}
This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs. I restricted the num_docs to only docs that actually have the storefield name.
I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).
Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script to finish it off.
All in all, here is the slightly mangled result :D
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cheat" : {
"buckets" : {
"all" : {
"doc_count" : 6,
"count" : {
"value" : 6
},
"unique_count" : {
"value" : 3
},
"total_average" : {
"value" : 2.0
}
}
}
},
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
}
]
},
"average_sales_count" : {
"value" : 2.5
}
}
}
Note that cheat.buckets.all.total_average is 2.0 (the true average) while the old way (pipeline average) is the non-global average of 2.5

Related

Elasticsearch How to count total docs by date

As my theme, I wanna count docs the day and before by date, it's sample to understand that the chart.
{"index":{"_index":"login-2015.12.23","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-23T12:00:00Z"}
{"index":{"_index":"login-2015.12.23","_type":"logs"}}
{"uid":"2","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-23T12:00:00Z"}
{"index":{"_index":"login-2015.12.24","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-24T12:00:00Z"}
{"index":{"_index":"login-2015.12.25","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-25T12:00:00Z"}
As you see, index login-2015.12.23 has two docs, index login-2015.12.24 has one doc, index login-2015.12.23 has one doc.
And now I wanna get the result
{
"hits" : {
"total" : 6282,
"max_score" : 1.0,
"hits" : []
},
"aggregations" : {
"group_by_date" : {
"buckets" : [
{
"key_as_string" : "2015-12-23T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 2,
},
{
"key_as_string" : "2015-12-24T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 3,
},
{
"key_as_string" : "2015-12-25T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 4,
}
]
}
If I count the date 2015-12-24T12:00:00Z and it means I must count day 2015-12-23T12:00:00Z and 2015-12-24T12:00:00Z at the same time.
In my project I have many indices like that, and I searching many ways to make this goal come true but not, this is my demo:
{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"intersect": {
"scripted_metric": {
"init_script": "state.inner=[]",
"map_script": "state.inner.add(params.param1 == 3 ? params.param2 * params.param1 : params.param1 * params.param2)",
"combine_script": "return state.inner",
"reduce_script": "return states",
"params": {
"param1": 3,
"param2": 5
}
}
}
}
}
}
}
I wanna group by date, and use scripted_metric to iter the date list, not the second iteration just can in its bucket and not for all the document, so do anyone has better idea to solve this problem?
You can simply use the cumulative sum pipeline aggregation
{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "login_time",
"interval": "day"
},
"aggs": {
"cumulative_docs": {
"cumulative_sum": {
"buckets_path": "_count"
}
}
}
}
}
}
And the results will look like this:
"aggregations" : {
"group_by_date" : {
"buckets" : [
{
"key_as_string" : "2015-12-23T00:00:00.000Z",
"key" : 1450828800000,
"doc_count" : 2,
"cumulative_docs" : {
"value" : 2.0
}
},
{
"key_as_string" : "2015-12-24T00:00:00.000Z",
"key" : 1450915200000,
"doc_count" : 1,
"cumulative_docs" : {
"value" : 3.0
}
},
{
"key_as_string" : "2015-12-25T00:00:00.000Z",
"key" : 1451001600000,
"doc_count" : 1,
"cumulative_docs" : {
"value" : 4.0
}
}
]
}
}

Can Elastic Search do aggregations for within a document?

I have a mapping like this:
mappings: {
"seller": {
"properties" : {
"overallRating": {"type" : byte}
"items": [
{
itemName: {"type": string},
itemRating: {"type" : byte}
}
]
}
}
}
Each item will only have one itemRating. Each seller will only have one overall rating. There can be many items, and at most I'm expecting maybe 50 items with itemRatings. Not all items have to have an itemRating.
I'm trying to get an average rating for each seller that combines all itemRatings and the overallRating. I have looked into aggregations but all I have seen are aggregations for across all documents. The aggregation I'm looking to do is within the document itself, and I am not sure if that is possible. Any tips would be appreciated.
Yes this is very much possible with Elasticeasrch. To produce a combined rating, you simply need to subaggregate by the document id. The only thing present in the bucket would be the individual document . That is what you want.
Here is an example:
Create the index:
PUT /ratings
{
"mappings": {
"properties": {
"overallRating": {"type" : "float"},
"items": {
"type" : "nested",
"properties": {
"itemName" : {"type" : "keyword"},
"itemRating" : {"type" : "float"},
"overallRating": {"type" : "float"}
}
}
}
}
}
Add some data:
POST ratings/_doc/
{
"overallRating" : 1,
"items" : [
{
"itemName" : "labrador",
"itemRating" : 10,
"overallRating" : 1
},
{
"itemName" : "saint bernard",
"itemRating" : 20,
"overallRating" : 1
}
]
}
{
"overallRating" : 1,
"items" : [
{
"itemName" : "cat",
"itemRating" : 5,
"overallRating" : 1
},
{
"itemName" : "rat",
"itemRating" : 10,
"overallRating" : 1
}
]
}
Query the index for a combined rating and sort by the rating:
GET ratings/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_rating": {
"composite": {
"sources": [
{
"ids": {
"terms": {
"field": "_id"
}
}
}
]
},
"aggs": {
"average_rating": {
"nested": {
"path": "items"
},
"aggs": {
"avg": {
"avg": {
"field": "items.compound"
}
}
}
}
}
}
},
"runtime_mappings": {
"items.compound": {
"type": "double",
"script": {
"source": "emit(doc['items.overallRating'].value + doc['items.itemRating'].value)"
}
}
}
}
The result (Pls note that i changed the exact values of ratings between writing the answer and running it in the console, so the averages are a bit different)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_rating" : {
"after_key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"buckets" : [
{
"key" : {
"ids" : "3_Up44EBbR3hrRYkLsrC"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 151.0
}
}
},
{
"key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 8.5
}
}
}
]
}
}
}
One change for convenience:
I edited your mappings to add the overAllRating to each Item entry. This simplifies the calculations that come subsequently, simply because you only look in the nested scope and never have to step out.
I also had to use a "runtime mapping" to combine the value of each overAllRating and ItemRating, to produce a better average. I basically made a sum of every ItemRating with the OverAllRating and averaged those across every entry.
I had to use a top level composite "id" aggregation so that we only get results per document (which is what you want).
There is some pretty heavy lifting happening here, but it is very possible and easy to edit this as you require.
HTH.

What is the best way to get the average number of documents per bucket in Elasticsearch?

Suppose that we are hat maker and have an Elasticsearch index where each document corresponds to the sale of one hat. Part of the sales record is the name of the store at which the hat was sold. I want to find out the number of hats sold by each store, and the average number of hats sold over all stores. The best way I have been able to figure out is this search:
GET hat_sales/_search
{
"size": 0,
"query": {"match_all": {}},
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 65536
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>sales_count"
}
}
}
}
(Aside: I set the size to 65536 because that is the default maximum number of buckets.)
The problem with this query is that the sales_count aggregation performs a redundant calculation: each stores bucket already has a doc_count property. But how can I access this doc_count in a buckets path?
I think this is what you are after
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
GET hat_sales/_search
{
"size": 0,
"query": {"match_all": {}},
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 65536
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
}
}
}
The path to get to the _count is stores>_count
The results look like:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
},
{
"key" : "bar",
"doc_count" : 1
}
]
},
"average_sales_count" : {
"value" : 2.0
}
}
}

How to use composite aggregation with a single bucket

The following composite aggregation query
{
"query": {
"range": {
"orderedAt": {
"gte": 1591315200000,
"lte": 1591438881000
}
}
},
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"aggregation_target": {
"terms": {
"field": "supplierId"
}
}
}
]
},
"aggs": {
"aggregated_hits": {
"top_hits": {}
},
"filter": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 2"
}
}
}
}
}
}
returns something like below.
{
"took" : 67,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 34,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_buckets" : {
"after_key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"buckets" : [
{
"key" : {
"aggregation_target" : "0HQI2G0K000100G8"
},
"doc_count" : 4,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G18G00100G8"
},
"doc_count" : 11,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"doc_count" : 16,
"aggregated_hits" : {...}
}
]
}
}
}
The aggregated results are put into buckets based on the condition set in the query.
Is there any way to put them in a single bucket and paginate thought the whole result(i.e. 31 documents in this case)?
I don't think you can. A doc's context doesn't include information about other docs unless you perform a cardinality, scripted_metric or terms aggregation. Also, once you bucket your docs based on the supplierId, it'd sort of defeat the purpose of aggregating in the first place...
What you wrote above is as good as it gets and you'll have to combine the aggregated_hits within some post processing step.

Elasticsearch, sort aggs according to sibling fields but from different index

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
}
//omitted other documents
influencers
{
'_id' : 1,
'name' : "John",
'smp_id' : 1,
'smp_score' : 5
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2,
'smp_score' : 10
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3,
'smp_score' : 15
}
//omitted other documents
Now I have this simple query that determines which influencer has the most document in the socialmedia index
GET socialmedia/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"INFLUENCERS": {
"terms": {
"field": "smp_id.keyword"
//smp_id is a **text** based field, that's why we have `.keyword` here
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"INFLUENCERS" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : "1",
"doc_count" : 87258
},
{
"key" : "2",
"doc_count" : 36518
},
{
"key" : "3",
"doc_count" : 34838
},
]
}
}
OBJECTIVE:
My query is able to sort the influencers according to doc_count of their posts in the socialmedia index, now, is there a way for us to sort the INFLUENCERS aggregation or make a way to sort out the influencers according to their SMP_SCORE?
With that idea, smp_id 3 which is Mark, should be the first one to appear since he has an smp_score of 15
Thank you in advance for your help!
What you are looking for is a JOIN operation. Note that Elasticsearch doesn't support JOIN operations unless they are modelled in a way as mentioned in this link.
Instead, a very simplistic approach is to denormalize your data and add the smp_score to your socialmedia index as below:
Mapping:
PUT socialmedia
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"smp_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"smp_score": {
"type": "float"
}
}
}
}
Your ES query would then have two Terms Aggregation as shown below:
Request Query:
POST socialmedia/_search
{
"size": 0,
"aggs": {
"influencers_score_agg": {
"terms": {
"field": "smp_score",
"order": { "_key": "desc" }
},
"aggs": {
"influencers_id_agg": {
"terms": {
"field": "smp_id.keyword"
}
}
}
}
}
}
Basically we are first aggregating on the smp_score and then introducing a sub-aggregation to display the smp_id.
Response:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_influencers_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 15.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 1
}
]
}
},
{
"key" : 10.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2",
"doc_count" : 1
}
]
}
},
{
"key" : 5.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1
}
]
}
}
]
}
}
}
Do spend sometime in reading the above link, however that would require you to model your index in a different way depending on the options mentioned in it. From what I understand, the solution I've provided would suffice.

Resources