Sub-aggregate a multi-level nested composite aggregation - elasticsearch

I'm trying to set up a search query that should composite aggregate a collection by a multi-level nested field and give me some sub-aggregation metrics from this collection. I was able to fetch the composite aggregation with its buckets as expected but the sub-aggregation metrics come with 0 for all buckets. I'm not sure if I am failing to correctly point out what fields the sub-aggregation should consider or if it should be placed inside a different part of the query.
My collection looks similar to the following:
{
id: '32ead132eq13w21',
statistics: {
clicks: 123,
views: 456
},
categories: [{ //nested type
name: 'color',
tags: [{ //nested type
slug: 'blue'
},{
slug: 'red'
}]
}]
}
Bellow you can find what I have tried so far. All buckets come with clicks sum as 0 even though all documents have a set clicks value.
GET /acounts-123321/_search
{
"size": 0,
"aggs": {
"nested_categories": {
"nested": {
"path": "categories"
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"group": {
"composite": {
"size": 100,
"sources": [
{ "slug": { "terms" : { "field": "categories.tags.slug"} }}
]
},
"aggregations": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
The response body I have so far:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1304,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"nested_categories" : {
"doc_count" : 1486,
"nested_tags" : {
"doc_count" : 1486,
"group" : {
"buckets" : [
{
"key" : {
"slug" : "red"
},
"doc_count" : 268,
"clicks" : {
"value" : 0.0
}
}, {
"key" : {
"slug" : "blue"
},
"doc_count" : 122,
"clicks" : {
"value" : 0.0
},
.....
]
}
}
}
}
}

In order for this to work, all sources in the composite aggregation would need to be under the same nested context.
I've answered something similar a while ago. The asker needed to put the nested values onto the top level. You have the opposite challenge -- given that the stats.clicks field is on the top level, you'd need to duplicate it across each entry of the categories.tags which, I suspect, won't be feasible because you're likely updating these stats every now and then…
If you're OK with skipping the composite approach and using the terms agg without it, you could make the summation work by jumping back to the top level thru reverse_nested:
{
"size": 0,
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
This'll work just as fine but won't offer pagination.
Clarification
If you needed a color filter, you could do:
{
"size": 0,
"aggs": {
"categories_parent": {
"nested": {
"path": "categories"
},
"aggs": {
"filtered_by_color": {
"filter": {
"term": {
"categories.name": "color"
}
},
"aggs": {
"nested_tags": {
"nested": {
"path": "categories.tags"
},
"aggs": {
"by_slug": {
"terms": {
"field": "categories.tags.slug",
"size": 100
},
"aggs": {
"back_to_parent": {
"reverse_nested": {},
"aggs": {
"clicks": {
"sum": {
"field": "statistics.clicks"
}
}
}
}
}
}
}
}
}
}
}
}
}
}

Related

Elastic Search return object with sum aggregation

I am trying to get a list of the top 100 guests by revenue generated with Elastic Search. To do this I am using a terms and a sum aggregation. However it does return the correct values, I wan to return the entire guest object with the aggregation.
This is my query:
GET reservations/_search
{
"size": 0,
"aggs": {
"top_revenue": {
"terms": {
"field": "total",
"size": 100,
"order": {
"top_revenue_hits": "desc"
}
},
"aggs": {
"top_revenue_sum": {
"sum": {
"field": "total"
}
}
}
}
}
}
This returns a list of the top 100 guests but only the amount they spent:
{
"aggregations" : {
"top_revenue" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 498,
"buckets" : [
{
"key" : 934.9500122070312,
"doc_count" : 8,
"top_revenue_hits" : {
"value" : 7479.60009765625
}
},
{
"key" : 922.0,
"doc_count" : 6,
"top_revenue_hits" : {
"value" : 5532.0
}
},
...
]
}
}
}
How can I get the query to return the entire guests object, not only the sum amount.
When I run GET reservations/_search it returns:
{
"hits": [
{
"_index": "reservations",
"_id": "1334620",
"_score": 1.0,
"_source": {
"id": "1334620",
"total": 110.8,
"payment": "unpaid",
"contact": {
"name": "John Doe",
"email": "john#mail.com"
}
}
},
... other reservations
]
}
I want to get this to return with the sum aggregation.
I have tried to use a top_hits aggregation, using _source it does return the entire guest object but it does not show the total amount spent. And when adding _source to the sum aggregation it gives an error.
Can I return the entire guest object with a sum aggregation or is this not the correct way?
I assumed that contact.name is keyword in the mapping. Following query should work for you.
{
"size": 0,
"aggs": {
"guests": {
"terms": {
"field": "contact.name",
"size": 100
},
"aggs": {
"sum_total": {
"sum": {
"field": "total"
}
},
"sortBy": {
"bucket_sort": {
"sort": [
{ "sum_total": { "order": "desc" } }
]
}
},
"guest": {
"top_hits": {
"size": 1
}
}
}
}
}
}

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Elasticsearch sub-aggregation queries that check whether some bucket values meet a condition

Hiii guys!! I am trying to pull out some aggregations that need to perform some specific computation logics per bucket, and it is killing me..
So I have some data tracking who uses what application feature like this:
[
{
"event_key": "basic_search",
"user": {
"tenant_tier": "free"
},
"origin": {
"visitor_id": "xxxxxxx"
}
},
{
"event_key": "registration",
"user": {
"tenant_tier": "basic"
},
"origin": {
"visitor_id": "xxxxxxx"
}
},
{
"event_key": "advanced_search",
"user": {
"tenant_tier": "basic"
},
"origin": {
"visitor_id": "xxxxxxx"
}
}
]
The user can opt to trial the app features using free tier identity, then register to enjoy other features. The origin.visitor_id is calculated from a website user's IP addresses and User-Agent etc.
With this data, I am hoping to answer this question: "how many people used free trial features BEFORE registering".
I came up with a ES query template like below, but couldn't figure out how to write the sub-aggregations that seem to require some more complex scripting against values in the bucket... Any advice is very much appreciated!
{
"aggs": {
"origin": {
"terms": {
"field": "origin.id.keyword",
"size": 1000
},
"aggs": {
"user_started_out_free": {
# ??????
# need to return a boolean telling whether `user.tenant_tier` of the first document in the bucket is `free`
}
},
"then_registered": {
# ??????
# need to return a boolean telling whether any `event_type` in the bucket is `registration`
},
"is_trial_user_then_registered": {
"bucket_script": {
"buckets_path": {
"user_started_out_free": "user_started_out_free"
"then_registered": "then_registered"
},
"script": "user_started_out_free && then_registered"
}
}
}
},
"num_trial_then_registered": {
"sum_bucket": {
"buckets_path": "origin>is_trial_user_then_registered"
}
}
}
}
You can use bucket selector aggregation to keep bucket where "trail" and "registration" both exists. Then use stats aggregation to get bucket count.
Query
{
"size": 0,
"aggs": {
"visitors": {
"terms": {
"field": "origin.visitor_id.keyword",
"size": 10
},
"aggs": {
"user_started_out_free": {
"filter": {
"term": {
"event_key.keyword": "basic_search"
}
}
},
"then_registered": {
"filter": {
"term": {
"event_key.keyword": "registration"
}
}
},
"user_first_free_then_registerd":{
"bucket_selector": {
"buckets_path": {
"free": "user_started_out_free._count",
"registered": "then_registered._count"
},
"script": "if(params.free>0 && params.registered>0) return true;"
}
}
}
},
"bucketcount":{
"stats_bucket":{
"buckets_path":"visitors._count"
}
}
}
}
Result
"visitors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 4,
"then_registered" : {
"doc_count" : 3
},
"user_started_out_free" : {
"doc_count" : 1
}
},
{
"key" : "1",
"doc_count" : 3,
"then_registered" : {
"doc_count" : 1
},
"user_started_out_free" : {
"doc_count" : 1
}
},
{
"key" : "2",
"doc_count" : 2,
"then_registered" : {
"doc_count" : 1
},
"user_started_out_free" : {
"doc_count" : 1
}
}
]
},
"bucketcount" : {
"count" : 3,
"min" : 2.0,
"max" : 4.0,
"avg" : 3.0,
"sum" : 9.0
}

How to count number of fields inside nested field? - Elasticsearch

I did the following mapping. I would like to count the number of products in each nested field "products" (for each document separately). I would also like to do a histogram aggregation, so that I would know the number of specific bucket sizes.
PUT /receipts
{
"mappings": {
"properties": {
"id" : {
"type": "integer"
},
"user_id" : {
"type": "integer"
},
"date" : {
"type": "date"
},
"sum" : {
"type": "double"
},
"products" : {
"type": "nested",
"properties": {
"name" : {
"type" : "text"
},
"number" : {
"type" : "double"
},
"price_single" : {
"type" : "double"
},
"price_total" : {
"type" : "double"
}
}
}
}
}
}
I've tried this query, but I get the number of all the products instead of number of products for each document separately.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products"
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count" : 6552,
"bucket_size" : {
"value" : 0
}
}
}
UPDATE
Now I have this code where I make separate buckets for each id and count the number of products inside them.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size" : 0,
"aggs": {
"terms":{
"terms":{
"field": "_id"
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 490,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"nested" : {
"doc_count" : 21,
"bucket_size" : {
"value" : 21
}
}
},
{
"key" : "10",
"doc_count" : 1,
"nested" : {
"doc_count" : 5,
"bucket_size" : {
"value" : 5
}
}
},
{
"key" : "100",
"doc_count" : 1,
"nested" : {
"doc_count" : 12,
"bucket_size" : {
"value" : 12
}
}
},
...
Is is possible to group these values (21, 5, 12, ...) into buckets to make a histogram of them?
products is only the path to the array of individual products, not an aggregatable field. So you'll need to use it on one of your product's field -- such as the number:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
Note that is a product has no number, it'll not contribute to the total count. It's therefore best practice to always include an ID in each of them and then aggregate on that field.
Alternatively you could use a script to account for missing values. Luckily value_count does not deduplicate -- meaning if two products are alike and/or have empty values, they'll still be counted as two:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"script": {
"source": "doc['products.number'].toString()"
}
}
}
}
}
}
}
UPDATE
You could also use a nested composite aggregation which'll give you the histogrammed product count w/ the corresponding receipt id:
GET /receipts/_search
{
"size": 0,
"aggs": {
"my_aggs": {
"nested": {
"path": "products"
},
"aggs": {
"composite_parent": {
"composite": {
"sources": [
{
"receipt_id": {
"terms": {
"field": "_id"
}
}
},
{
"product_number": {
"histogram": {
"field": "products.number",
"interval": 1
}
}
}
]
}
}
}
}
}
}
The interval is modifiable.

Elastic Search: Selecting multiple vlaues in aggregates

In Elastic Search I have the following index with 'allocated_bytes', 'total_bytes' and other fields:
{
"_index" : "metrics-blockstore_capacity-2017_06",
"_type" : "datapoint",
"_id" : "AVzHwgsi9KuwEU6jCXy5",
"_score" : 1.0,
"_source" : {
"timestamp" : 1498000001000,
"resource_guid" : "2185d15c-5298-44ac-8646-37575490125d",
"allocated_bytes" : 1.159196672E9,
"resource_type" : "machine",
"total_bytes" : 1.460811776E11,
"machine" : "2185d15c-5298-44ac-8646-37575490125d"
}
I have the following query to
1)get a point for 30 minute interval using date-histogram
2)group by field on resource_guid.
3)max aggregate to find the max value.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValue": {
"max": {
"field": "allocated_bytes"
}
}
}
},
"sumUnique": {
"sum_bucket": {
"buckets_path": "groupByField>maxValue"
}
}
}
}
}
}
But with this query I am able to get only allocated_bytes, but I need to have both allocated_bytes and total_bytes at the result point.
Following is the result from the above query:
{
"key_as_string" : "2017-06-20T21:00:00.000Z",
"key" : 1497992400000,
"doc_count" : 9,
"groupByField" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "2185d15c-5298-44ac-8646-37575490125d",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
}, {
"key" : "c3513cdd-58bb-4f8e-9b4c-467230b4f6e2",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156165632E9
}
}, {
"key" : "eff13403-9737-4d08-9dca-fb6c12c3a6fa",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
} ]
},
"sumUnique" : {
"value" : 3.468529664E9
}
}
I do need both allocated_bytes and total_bytes. How do I get multiple fields( allocated_bytes, total_bytes) for each point?
For example:
"sumUnique" : {
"Allocatedvalue" : 3.468529664E9,
"TotalValue" : 9.468529664E9
}
or like this:
"allocatedBytessumUnique" : {
"value" : 3.468529664E9
}
"totalBytessumUnique" : {
"value" : 9.468529664E9
},
You can just add another aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValueAllocated": {
"max": {
"field": "allocated_bytes"
}
},
"maxValueTotal": {
"max": {
"field": "total_bytes"
}
}
}
},
"sumUniqueAllocatedBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueAllocated"
}
},
"sumUniqueTotalBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueTotal"
}
}
}
}
}
}
I hope you are aware that sum_bucket calculates sibling aggregations only, in this case gives sum of max values, not the sum of total_bytes. If you want to get sum of total_bytes you can use sum aggregation

Resources