Can Elastic Search do aggregations for within a document? - elasticsearch

I have a mapping like this:
mappings: {
"seller": {
"properties" : {
"overallRating": {"type" : byte}
"items": [
{
itemName: {"type": string},
itemRating: {"type" : byte}
}
]
}
}
}
Each item will only have one itemRating. Each seller will only have one overall rating. There can be many items, and at most I'm expecting maybe 50 items with itemRatings. Not all items have to have an itemRating.
I'm trying to get an average rating for each seller that combines all itemRatings and the overallRating. I have looked into aggregations but all I have seen are aggregations for across all documents. The aggregation I'm looking to do is within the document itself, and I am not sure if that is possible. Any tips would be appreciated.

Yes this is very much possible with Elasticeasrch. To produce a combined rating, you simply need to subaggregate by the document id. The only thing present in the bucket would be the individual document . That is what you want.
Here is an example:
Create the index:
PUT /ratings
{
"mappings": {
"properties": {
"overallRating": {"type" : "float"},
"items": {
"type" : "nested",
"properties": {
"itemName" : {"type" : "keyword"},
"itemRating" : {"type" : "float"},
"overallRating": {"type" : "float"}
}
}
}
}
}
Add some data:
POST ratings/_doc/
{
"overallRating" : 1,
"items" : [
{
"itemName" : "labrador",
"itemRating" : 10,
"overallRating" : 1
},
{
"itemName" : "saint bernard",
"itemRating" : 20,
"overallRating" : 1
}
]
}
{
"overallRating" : 1,
"items" : [
{
"itemName" : "cat",
"itemRating" : 5,
"overallRating" : 1
},
{
"itemName" : "rat",
"itemRating" : 10,
"overallRating" : 1
}
]
}
Query the index for a combined rating and sort by the rating:
GET ratings/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_rating": {
"composite": {
"sources": [
{
"ids": {
"terms": {
"field": "_id"
}
}
}
]
},
"aggs": {
"average_rating": {
"nested": {
"path": "items"
},
"aggs": {
"avg": {
"avg": {
"field": "items.compound"
}
}
}
}
}
}
},
"runtime_mappings": {
"items.compound": {
"type": "double",
"script": {
"source": "emit(doc['items.overallRating'].value + doc['items.itemRating'].value)"
}
}
}
}
The result (Pls note that i changed the exact values of ratings between writing the answer and running it in the console, so the averages are a bit different)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_rating" : {
"after_key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"buckets" : [
{
"key" : {
"ids" : "3_Up44EBbR3hrRYkLsrC"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 151.0
}
}
},
{
"key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 8.5
}
}
}
]
}
}
}
One change for convenience:
I edited your mappings to add the overAllRating to each Item entry. This simplifies the calculations that come subsequently, simply because you only look in the nested scope and never have to step out.
I also had to use a "runtime mapping" to combine the value of each overAllRating and ItemRating, to produce a better average. I basically made a sum of every ItemRating with the OverAllRating and averaged those across every entry.
I had to use a top level composite "id" aggregation so that we only get results per document (which is what you want).
There is some pretty heavy lifting happening here, but it is very possible and easy to edit this as you require.
HTH.

Related

Perform a pipelines aggregation over the full set of potential buckets

When using the _search API of Elasticsearch, if you set size to 10, and perform an avg metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits array.
On the other hand, if you perform a terms aggregation and set the size of the terms aggregation to be 10, then performing an avg_buckets aggregation on those terms buckets will calculate an average over only those 10 buckets - not all potential buckets.
How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets array?
To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.
Here are two examples of the documents I tested this on:
{
"type": "top",
"color": "black",
"price": 19,
"store": "Macy's"
}
{
"type": "fez",
"color": "red",
"price": 94,
"store": "Walmart"
}
If I want to find the average price of all the hats I have sold, I can run this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_hat_price": {
"avg": {
"field": "price"
}
}
}
}
And average_hat_price will be the same whether size is set to 0, 3, or whatever.
OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"by_store": {
"terms": {
"field": "store.keyword",
"size": 3
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"avg sales at a store": {
"avg_bucket": {
"buckets_path": "by_store>sales_count"
}
}
}
}
which yields a response of
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
}
]
},
"avg sales at a store" : {
"value" : 4.666666666666667
}
}
The problem is that avg sales at a store is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Target",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Harrod's",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Men's Warehouse",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Sears",
"doc_count" : 1,
"sales_count" : {
"value" : 1
}
}
]
},
"avg sales at a store" : {
"value" : 3.142857142857143
}
}
So the average number of hats sold per store is 3.1, not 4.6. But in the buckets array I want to see only the top 3 stores.
You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.
Here is the data setup:
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
Here is the tricky query:
GET hat_sales/_search?size=0
{
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 2
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
},
"cheat": {
"filters": {
"filters": {
"all": {
"exists": {
"field": "storename"
}
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "storename"
}
},
"unique_count": {
"cardinality": {
"field": "storename"
}
},
"total_average": {
"bucket_script": {
"buckets_path": {
"total": "count",
"unique": "unique_count"
},
"script": "params.total / params.unique"
}
}
}
}
}
}
This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs. I restricted the num_docs to only docs that actually have the storefield name.
I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).
Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script to finish it off.
All in all, here is the slightly mangled result :D
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cheat" : {
"buckets" : {
"all" : {
"doc_count" : 6,
"count" : {
"value" : 6
},
"unique_count" : {
"value" : 3
},
"total_average" : {
"value" : 2.0
}
}
}
},
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
}
]
},
"average_sales_count" : {
"value" : 2.5
}
}
}
Note that cheat.buckets.all.total_average is 2.0 (the true average) while the old way (pipeline average) is the non-global average of 2.5

How to use composite aggregation with a single bucket

The following composite aggregation query
{
"query": {
"range": {
"orderedAt": {
"gte": 1591315200000,
"lte": 1591438881000
}
}
},
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"aggregation_target": {
"terms": {
"field": "supplierId"
}
}
}
]
},
"aggs": {
"aggregated_hits": {
"top_hits": {}
},
"filter": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 2"
}
}
}
}
}
}
returns something like below.
{
"took" : 67,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 34,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_buckets" : {
"after_key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"buckets" : [
{
"key" : {
"aggregation_target" : "0HQI2G0K000100G8"
},
"doc_count" : 4,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G18G00100G8"
},
"doc_count" : 11,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"doc_count" : 16,
"aggregated_hits" : {...}
}
]
}
}
}
The aggregated results are put into buckets based on the condition set in the query.
Is there any way to put them in a single bucket and paginate thought the whole result(i.e. 31 documents in this case)?
I don't think you can. A doc's context doesn't include information about other docs unless you perform a cardinality, scripted_metric or terms aggregation. Also, once you bucket your docs based on the supplierId, it'd sort of defeat the purpose of aggregating in the first place...
What you wrote above is as good as it gets and you'll have to combine the aggregated_hits within some post processing step.

Elasticsearch aggregation on different search in same query

I want to make a query to aggregate base only on match no matter what other parameters(terms , term , etc...) are used.
To be more specific I have an online shop where I use multiple filters (color ,size etc..) If I check a field for example color : red the other colors are no longer aggregated.
A solution that I am using is to make 2 separated queries (one for search where filters are applied and other for aggregation. Any idea how can I combine the 2 separated queries ?
You can take advantage of post_filter which will not apply to your aggregations but will only filter the to-be-returned hits. For example:
Create a shop
PUT online_shop
{
"mappings": {
"properties": {
"color": {
"type": "keyword"
},
"size": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Populate it w/ a few products
POST online_shop/_doc
{"color":"red","size":35,"name":"Louboutin High heels abc"}
POST online_shop/_doc
{"color":"black","size":34,"name":"Louboutin Boots abc"}
POST online_shop/_doc
{"color":"yellow","size":36,"name":"XYZ abc"}
Apply a shared query to the hits as well as aggregations and use post_filter to ... post-filter the hits:
GET online_shop/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
}
]
}
},
"aggs": {
"by_color": {
"terms": {
"field": "color"
}
},
"by_size": {
"terms": {
"field": "size"
}
}
},
"post_filter": {
"bool": {
"must": [
{
"term": {
"color": {
"value": "red"
}
}
}
]
}
}
}
Expected result
{
...
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.11750763,
"hits" : [
{
"_index" : "online_shop",
"_type" : "_doc",
"_id" : "cehma3IBG_KW3EFn1QYa",
"_score" : 0.11750763,
"_source" : {
"color" : "red",
"size" : 35,
"name" : "Louboutin High heels abc"
}
}
]
},
"aggregations" : {
"by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "black",
"doc_count" : 1
},
{
"key" : "red",
"doc_count" : 1
},
{
"key" : "yellow",
"doc_count" : 1
}
]
},
"by_size" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 34,
"doc_count" : 1
},
{
"key" : 35,
"doc_count" : 1
},
{
"key" : 36,
"doc_count" : 1
}
]
}
}
}

ElasticSearch: How to make an aggregation pipeline?

Imagine the following use case:
We work at Stark Airlines and our marketing team wants to segment our passengers in order to give them discounts or gift cards. They decide that they want two sets of passengers:
Passengers that fly at least 3 times per week
Passenger who have flown at least once but who have not flown for two weeks
With this they can make different marketing campaigns for our passengers!
So, in elastic search we have a trip index that represents a ticket bought by a passenger:
{
"_index" : "trip",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"total_amount" : 300,
"trip_date" : "2020/03/24 13:30:00",
"status" : "completed",
"passenger" : {
"id" : 11,
"name" : "Thiago nunes"
}
}
}
The trip index contains a status field that may have other values like: pending or open or canceled
This means that we can only take into account trips that has the completed status (Meaning the passenger did travel).
So, with all this in mind...How would I get those two sets of passengers with elastic search?
I have been trying for a while but with no success.
What I have done until now:
I have built a query that gets all valid trip (trips with status completed)
GET /trip/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"status": {
"value": "completed"
}
}
}
]
}
},
"aggs": {
"status_viagem": {
"terms": {
"field": "status.keyword"
}
}
}
}
This query returns the following:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 200,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [...]
},
"aggregations" : {
"status_viagem" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 200
}
]
}
}
}
But I am stuck and can't figure out the next step. I know that the next thing to do should create buckets with passengers and then filter them in two buckets representing our desired data sets. But I don't know how.
Can someone help?
PS.:
I don't exactly need this to be one single query, just a hint about how to build a query like this would be very helpful
THE OUPUT SHOULD BE AN ARRAY of passenger id's
Note: I have shortened the trip index for the sake of simplicity
As per my understanding of your issue.
I have used date_histogram with interval as week to get collection on passengers which week. Only those passengers are kept which have three documents in a week. This will give you all passengers which have traveled thrice in a week.
In another aggregation I have use terms aggregation to get passengers and their last travel date. Using bucket selector have kept passengers whose last travel is not beyond certain date.
Mapping
{
"index87" : {
"mappings" : {
"properties" : {
"passengerid" : {
"type" : "long"
},
"passengername" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"status" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"total_amount" : {
"type" : "long"
},
"trip_date" : {
"type" : "date"
}
}
}
}
}
Query
{
"query": {
"bool": {
"must": [
{
"term": {
"status": {
"value": "completed"
}
}
}
]
}
},
"aggs": {
"travel_thrice_week": {
"date_histogram": {
"field": "trip_date",
"interval": "week"
},
"aggs": {
"passenger": {
"terms": {
"field": "passengername.keyword",
"min_doc_count": 3,
"size": 10
}
},
"select_bucket_with_user": {-->to keep weeks which have a pasenger with thrice
--> a day travel
"bucket_selector": {
"buckets_path": {
"passenger": "passenger._bucket_count"
},
"script": "if(params['passenger']>=1) {return true;} else{ return false;} "
}
}
}
},
"not_flown_last_two_week": {
"terms": {
"field": "passengername.keyword",
"size": 10
},
"aggs": {
"last_travel": {
"max": {
"field": "trip_date" --> most recent travel
}
},
"last_travel_before_two_week": {
"bucket_selector": {
"buckets_path": {
"traveldate": "last_travel"
},
"script":{
"source": "if(params['traveldate']< params['date_epoch']) return true; else return false;",
"params": {
"date_epoch":1586408336000 --> unix epoc of cutt off date
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"not_flown_last_two_week" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Thiago nunes",
"doc_count" : 3,
"last_travel" : {
"value" : 1.5851808E12,
"value_as_string" : "2020-03-26T00:00:00.000Z"
}
},
{
"key" : "john doe",
"doc_count" : 1,
"last_travel" : {
"value" : 1.5799968E12,
"value_as_string" : "2020-01-26T00:00:00.000Z"
}
}
]
},
"travel_thrice_week" : {
"buckets" : [
{
"key_as_string" : "2020-03-23T00:00:00.000Z",
"key" : 1584921600000,
"doc_count" : 3,
"passenger" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Thiago nunes",
"doc_count" : 3
}
]
}
}
]
}
}

Elasticsearch, sort aggs according to sibling fields but from different index

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
}
//omitted other documents
influencers
{
'_id' : 1,
'name' : "John",
'smp_id' : 1,
'smp_score' : 5
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2,
'smp_score' : 10
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3,
'smp_score' : 15
}
//omitted other documents
Now I have this simple query that determines which influencer has the most document in the socialmedia index
GET socialmedia/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"INFLUENCERS": {
"terms": {
"field": "smp_id.keyword"
//smp_id is a **text** based field, that's why we have `.keyword` here
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"INFLUENCERS" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : "1",
"doc_count" : 87258
},
{
"key" : "2",
"doc_count" : 36518
},
{
"key" : "3",
"doc_count" : 34838
},
]
}
}
OBJECTIVE:
My query is able to sort the influencers according to doc_count of their posts in the socialmedia index, now, is there a way for us to sort the INFLUENCERS aggregation or make a way to sort out the influencers according to their SMP_SCORE?
With that idea, smp_id 3 which is Mark, should be the first one to appear since he has an smp_score of 15
Thank you in advance for your help!
What you are looking for is a JOIN operation. Note that Elasticsearch doesn't support JOIN operations unless they are modelled in a way as mentioned in this link.
Instead, a very simplistic approach is to denormalize your data and add the smp_score to your socialmedia index as below:
Mapping:
PUT socialmedia
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"smp_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"smp_score": {
"type": "float"
}
}
}
}
Your ES query would then have two Terms Aggregation as shown below:
Request Query:
POST socialmedia/_search
{
"size": 0,
"aggs": {
"influencers_score_agg": {
"terms": {
"field": "smp_score",
"order": { "_key": "desc" }
},
"aggs": {
"influencers_id_agg": {
"terms": {
"field": "smp_id.keyword"
}
}
}
}
}
}
Basically we are first aggregating on the smp_score and then introducing a sub-aggregation to display the smp_id.
Response:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_influencers_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 15.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 1
}
]
}
},
{
"key" : 10.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2",
"doc_count" : 1
}
]
}
},
{
"key" : 5.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1
}
]
}
}
]
}
}
}
Do spend sometime in reading the above link, however that would require you to model your index in a different way depending on the options mentioned in it. From what I understand, the solution I've provided would suffice.

Resources