Get top values from Elasticsearch bucket - elasticsearch

I have some items with brand
I want to return N records, but no more than x from each bucket
So far I have my buckets grouped by brand
"aggs": {
"brand": {
"terms": {
"field": "brand"
}
}
}
"aggregations" : {
"brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "brandA",
"doc_count" : 130
},
{
"key" : "brandB",
"doc_count" : 127
}
]
}
But how do I access specific bucket and get top x values from there?

You can use top hits sub aggregation to get documents under each brand. You can sort those documents and define a size too.
{
"aggs": {
"brand": {
"terms": {
"field": "brand",
"size": 10 --> no of brands
},
"aggs": {
"top_docs": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1 --> no of documents under each brand
}
}
}
}
}
}

Related

Elastic Search return object with sum aggregation

I am trying to get a list of the top 100 guests by revenue generated with Elastic Search. To do this I am using a terms and a sum aggregation. However it does return the correct values, I wan to return the entire guest object with the aggregation.
This is my query:
GET reservations/_search
{
"size": 0,
"aggs": {
"top_revenue": {
"terms": {
"field": "total",
"size": 100,
"order": {
"top_revenue_hits": "desc"
}
},
"aggs": {
"top_revenue_sum": {
"sum": {
"field": "total"
}
}
}
}
}
}
This returns a list of the top 100 guests but only the amount they spent:
{
"aggregations" : {
"top_revenue" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 498,
"buckets" : [
{
"key" : 934.9500122070312,
"doc_count" : 8,
"top_revenue_hits" : {
"value" : 7479.60009765625
}
},
{
"key" : 922.0,
"doc_count" : 6,
"top_revenue_hits" : {
"value" : 5532.0
}
},
...
]
}
}
}
How can I get the query to return the entire guests object, not only the sum amount.
When I run GET reservations/_search it returns:
{
"hits": [
{
"_index": "reservations",
"_id": "1334620",
"_score": 1.0,
"_source": {
"id": "1334620",
"total": 110.8,
"payment": "unpaid",
"contact": {
"name": "John Doe",
"email": "john#mail.com"
}
}
},
... other reservations
]
}
I want to get this to return with the sum aggregation.
I have tried to use a top_hits aggregation, using _source it does return the entire guest object but it does not show the total amount spent. And when adding _source to the sum aggregation it gives an error.
Can I return the entire guest object with a sum aggregation or is this not the correct way?
I assumed that contact.name is keyword in the mapping. Following query should work for you.
{
"size": 0,
"aggs": {
"guests": {
"terms": {
"field": "contact.name",
"size": 100
},
"aggs": {
"sum_total": {
"sum": {
"field": "total"
}
},
"sortBy": {
"bucket_sort": {
"sort": [
{ "sum_total": { "order": "desc" } }
]
}
},
"guest": {
"top_hits": {
"size": 1
}
}
}
}
}
}

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

How to filter by sub-aggregated results in Elasticsearch

I've got the following elastic search query in order to get the number of product sales per hour grouped by product id and hour of sale.
POST /my_sales/_search?size=0
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
}
}
}
}
}
}
One example of data :
{
"#timestamp" : "2020-10-29T18:09:56.921Z",
"name" : "my-beautifull_product",
"event_time" : "2020-10-17T08:01:33.397Z"
}
This query returns several buckets (one per hour and per product) but i would like to only retrieve those who have a doc_count higher than 10 for example, is it possible ?
For those results i would like to know the id of the product and the event_time bucket.
Thanks for your help.
Perhaps using the Bucket Selector feature will help on filtering out the results.
Try out this below search query:
{
"aggs": {
"sales_per_hour": {
"date_histogram": {
"field": "event_time",
"fixed_interval": "1h",
"format": "yyyy-MM-dd:HH:mm"
},
"aggs": {
"sales_per_hour_per_product": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"the_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 10"
}
}
}
}
}
}
}
}
It will filter out all the documents, whose count is greater than 10 based on "params.the_doc_count > 10"
Thank you for your help this is not far from what i would like but not exactly ; with the bucket selector i have something like this :
"aggregations" : {
"sales_per_hour" : {
"buckets" : [
{
"key_as_string" : "2020-08-31:23:00",
"key" : 1598914800000,
"doc_count" : 16,
"sales_per_hour_per_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "my_product_1",
"doc_count" : 2
},
{
"key" : "my_product_2",
"doc_count" : 2
},
{
"key" : "myproduct_3",
"doc_count" : 12
}
]
}
}
]
}
And sometimes none of the buckets are greater than 10, is it possible to have the same thing but with the filter on _count applied to the second level aggregation (sales_per_hour_per_product) and not on the first level (sales_per_hour) ?

ElasticSearch multiple terms aggregation order

I have a document structure which describes a container, some of its fields are:
containerId -> Unique Id,String
containerManufacturer -> String
containerValue -> Double
estContainerWeight ->Double
actualContainerWeight -> Double
I want to run a search aggregation which has two levels of terms aggregations on the two weight fields, but in descending order of the weight fields, like below:
{
"size": 0,
"aggs": {
"by_manufacturer": {
"terms": {
"field": "containerManufacturer",
"size": 10,
"order": {"estContainerWeight": "desc"} //Cannot do this
},
"aggs": {
"by_est_weight": {
"terms": {
"field": "estContainerWeight",
"size": 10,
"order": { "actualContainerWeight": "desc"} //Cannot do this
},
"aggs": {
"by_actual_weight": {
"terms": {
"field": "actualContainerWeight",
"size": 10
},
"aggs" : {
"container_value_sum" : {"sum" : {"field" : "containerValue"}}
}
}
}
}
}
}
}
}
Sample documents:
{"containerId":1,"containerManufacturer":"A","containerValue":12,"estContainerWeight":5.0,"actualContainerWeight":5.1}
{"containerId":2,"containerManufacturer":"A","containerValue":24,"estContainerWeight":5.0,"actualContainerWeight":5.2}
{"containerId":3,"containerManufacturer":"A","containerValue":23,"estContainerWeight":5.0,"actualContainerWeight":5.2}
{"containerId":4,"containerManufacturer":"A","containerValue":32,"estContainerWeight":6.0,"actualContainerWeight":6.2}
{"containerId":5,"containerManufacturer":"A","containerValue":26,"estContainerWeight":6.0,"actualContainerWeight":6.3}
{"containerId":6,"containerManufacturer":"A","containerValue":23,"estContainerWeight":6.0,"actualContainerWeight":6.2}
Expected Output(not complete):
{
"by_manufacturer": {
"buckets": [
{
"key": "A",
"by_est_weight": {
"buckets": [
{
"key" : 5.0,
"by_actual_weight" : {
"buckets" : [
{
"key" : 5.2,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
},
{
"key" : 5.1,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
}
]
}
},
{
"key" : 6.0,
"by_actual_weight" : {
"buckets" : [
{
"key" : 6.2,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
},
{
"key" : 6.3,
"container_value_sum" : {
"value" : 1234 //Not actual sum
}
}
]
}
}
]
}
}
]
}
}
However, I cannot order by the nested aggregations. (Error: Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation...)
For example, for the above sample output, I have no control on the buckets generated if I introduce a size on the terms aggregations (which I will have to do if my data is large),so I would like to only get the top N weights for each terms aggregation.
Is there a way to do this ?
If I understand your problem correctly, you would like to sort the manufacturer terms in decreasing order of the estimated weights of their containers and then each bucket of "estimated weight" in decreasing order of their actual weight.
{
"size": 0,
"aggs": {
"by_manufacturer": {
"terms": {
"field": "containerManufacturer",
"size": 10
},
"by_est_weight": {
"terms": {
"field": "estContainerWeight",
"size": 10,
"order": {
"_term": "desc" <--- change to this
}
},
"by_actual_weight": {
"terms": {
"field": "actualContainerWeight",
"size": 10,
"order" : {"_term" : "desc"} <----- Change to this
},
"aggs": {
"container_value_sum": {
"sum": {
"field": "containerValue"
}
}
}
}
}
}
}
}
}
}

Why elasticsearch cannot support min_doc_count and order by _count asc?

Requirements:
group by hldId having count(*) = 2
Elasticsearch query:
"aggs": {
"groupByHldId": {
"terms": {
"field": "hldId",
"min_doc_count": 2,
"order" : { "_count" : "asc" }
}
}
}
but no records are return
"aggregations" : {
"groupByHldId" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 2660,
"buckets" : [ ]
}
}
but if changed to desc , it has return
"buckets" : [
{
"key" : 200035075,
"doc_count" : 355
},
or if without min_doc_count, it also has return
"buckets" : [
{
"key" : 200000061,
"doc_count" : 1
},
So why both have mid_doc_count and asc direction it returns empty?
You can try like this, bucket selector with a custom script.
{
"aggs": {
"countfield": {
"terms": {
"field": "hldId",
"size": 100,
"order": {
"_count": "desc"
}
},
"aggs": {
"criticals": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count==2"
}
}
}
}
}
}

Resources