Return top N buckets only - elasticsearch

So in elastic search I can do something like this:
{
"aggs": {
"title": {
"terms": {
"field": "title",
"shard_size": 50,
"size": 5
}
}
},
"query": {...},
"size": 0
}
And this will return me the document counts of the top 5 titles, so we end up with something like (in part):
"buckets" : [
{
"key" : "Delivery Driver",
"doc_count" : 1495
},
{
"key" : "Assistant Manager",
"doc_count" : 1250
},
{
"key" : "Server",
"doc_count" : 1175
},
{
"key" : "Dishwasher",
"doc_count" : 966
},
{
"key" : "Team Member",
"doc_count" : 960
}
]
But now I need to have the document counts in some custom buckets, so I do something like this:
{
"aggs": {
"loc": {
"filters": {
"filters": {
"1042_2": {
"terms": {
"counties": [
...
]
}
},
"1594_2": {
"terms": {
"counties": [
...
]
}
},
"1714_2": {
"terms": {
"counties": [
...
]
}
},
"1746_2": {
"terms": {
"counties": [
...
]
}
},
"1814_2": {
"terms": {
"counties": [
...
]
}
},
"1943_2": {
"terms": {
"counties": [
...
]
}
},
"2658_2": {
"terms": {
"counties": [
...
]
}
}
}
}
}
},
"query": {...},
"size": 0
}
Note that there are 7 buckets, because we don't know which are the largest. Running this will return us:
"buckets" : {
"1042_2" : {
"doc_count" : 23687
},
"1594_2" : {
"doc_count" : 8951
},
"1714_2" : {
"doc_count" : 52555
},
"1746_2" : {
"doc_count" : 60534
},
"1814_2" : {
"doc_count" : 63956
},
"1943_2" : {
"doc_count" : 25533
},
"2658_2" : {
"doc_count" : 534
}
}
But I would like it to only return me the largest 5 instead of all the buckets. Is there a way to restrict it to only the n largest buckets in the same way that the size parameter under terms did?

The size parameter does not make sense for filters aggregation, because by specifying the filters you already explicitly specify/control the number of buckets to get created and returned.
What you may want to consider though is, that you get all potential buckets created, but then get them sorted by descending count with an order-clause.
On client side then you simply "consume" the first n buckets.

Related

How to exclude the buckets having doc count equal to 0

I want to exclude those buckets from the date histogram aggregation response, whose doc count is equal to 0. And then, get the count of the filtered buckets.
The query is :
GET metricbeat-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"host.cpu.usage": {
"gte": 0.8
}
}
},
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:18:00.000Z"
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}
The response that I get right now is :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 38,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"hostName" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "datahot01",
"doc_count" : 3,
"docsOverTimeFrame" : {
"buckets" : [
{
"key_as_string" : "2022-09-22T10:16:00.000Z",
"key" : 1663841760000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:10.000Z",
"key" : 1663841770000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:20.000Z",
"key" : 1663841780000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:30.000Z",
"key" : 1663841790000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:40.000Z",
"key" : 1663841800000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:50.000Z",
"key" : 1663841810000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:00.000Z",
"key" : 1663841820000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:10.000Z",
"key" : 1663841830000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:20.000Z",
"key" : 1663841840000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:30.000Z",
"key" : 1663841850000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:40.000Z",
"key" : 1663841860000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:50.000Z",
"key" : 1663841870000,
"doc_count" : 0
}
]
}
}
]
}
}
}
So, if I am able to exclude those buckets that have doc count = 0, then on the basis of the number of buckets (that is bucket count), I want to check whether the count of buckets formed is equal to 12 or not (which I am doing using the bucket selector aggregation).
Is there some way to exclude the buckets having doc count = 0, and get the bucket count = 2 instead of 12
I was able to solve the above use case, by using a pipeline aggregation (i.e a bucket_selector aggregation) inside of the date histogram aggregation.
The modified query is :
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:22:00.000Z"
}
}
},
{
"range": {
"system.cpu.total.norm.pct": {
"gte": 0.8
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
},
"aggs": {
"histogram_doc_count": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 0"
}
}
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Elasticsearch sub-aggregation queries that check whether some bucket values meet a condition

Hiii guys!! I am trying to pull out some aggregations that need to perform some specific computation logics per bucket, and it is killing me..
So I have some data tracking who uses what application feature like this:
[
{
"event_key": "basic_search",
"user": {
"tenant_tier": "free"
},
"origin": {
"visitor_id": "xxxxxxx"
}
},
{
"event_key": "registration",
"user": {
"tenant_tier": "basic"
},
"origin": {
"visitor_id": "xxxxxxx"
}
},
{
"event_key": "advanced_search",
"user": {
"tenant_tier": "basic"
},
"origin": {
"visitor_id": "xxxxxxx"
}
}
]
The user can opt to trial the app features using free tier identity, then register to enjoy other features. The origin.visitor_id is calculated from a website user's IP addresses and User-Agent etc.
With this data, I am hoping to answer this question: "how many people used free trial features BEFORE registering".
I came up with a ES query template like below, but couldn't figure out how to write the sub-aggregations that seem to require some more complex scripting against values in the bucket... Any advice is very much appreciated!
{
"aggs": {
"origin": {
"terms": {
"field": "origin.id.keyword",
"size": 1000
},
"aggs": {
"user_started_out_free": {
# ??????
# need to return a boolean telling whether `user.tenant_tier` of the first document in the bucket is `free`
}
},
"then_registered": {
# ??????
# need to return a boolean telling whether any `event_type` in the bucket is `registration`
},
"is_trial_user_then_registered": {
"bucket_script": {
"buckets_path": {
"user_started_out_free": "user_started_out_free"
"then_registered": "then_registered"
},
"script": "user_started_out_free && then_registered"
}
}
}
},
"num_trial_then_registered": {
"sum_bucket": {
"buckets_path": "origin>is_trial_user_then_registered"
}
}
}
}
You can use bucket selector aggregation to keep bucket where "trail" and "registration" both exists. Then use stats aggregation to get bucket count.
Query
{
"size": 0,
"aggs": {
"visitors": {
"terms": {
"field": "origin.visitor_id.keyword",
"size": 10
},
"aggs": {
"user_started_out_free": {
"filter": {
"term": {
"event_key.keyword": "basic_search"
}
}
},
"then_registered": {
"filter": {
"term": {
"event_key.keyword": "registration"
}
}
},
"user_first_free_then_registerd":{
"bucket_selector": {
"buckets_path": {
"free": "user_started_out_free._count",
"registered": "then_registered._count"
},
"script": "if(params.free>0 && params.registered>0) return true;"
}
}
}
},
"bucketcount":{
"stats_bucket":{
"buckets_path":"visitors._count"
}
}
}
}
Result
"visitors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 4,
"then_registered" : {
"doc_count" : 3
},
"user_started_out_free" : {
"doc_count" : 1
}
},
{
"key" : "1",
"doc_count" : 3,
"then_registered" : {
"doc_count" : 1
},
"user_started_out_free" : {
"doc_count" : 1
}
},
{
"key" : "2",
"doc_count" : 2,
"then_registered" : {
"doc_count" : 1
},
"user_started_out_free" : {
"doc_count" : 1
}
}
]
},
"bucketcount" : {
"count" : 3,
"min" : 2.0,
"max" : 4.0,
"avg" : 3.0,
"sum" : 9.0
}

ElasticSearch: Query to find max of count of objects based on field value

For the example document below in the index, I want to find max of count of actions based on component name across all documents in the index. Could you please help to find a way for this.
Expected result assuming only one document present in the Index:
comp1 -> action1 -> max 2 times
comp1 -> action2 -> max 1 time
comp2 -> action2 -> max 1 time
comp2 -> action3 -> max 1 time
Sample Document:
{
"id": "AC103902:A13A_AC140008:01BB_5FA2E8FA_1C08:0007",
"tokens": [
{
"name": "comp1",
"items": [
{
"action": "action1",
"attr": "value"
},
{
"action": "action1",
"attr": "value"
},
{
"action": "action2",
"attr": "value"
}
]
},
{
"name": "comp2",
"items": [
{
"action": "action2",
"attr": "value"
},
{
"action": "action3",
"attr": "value"
}
]
}
]
}
ElasticSearch Version: 7.9
I can loop through each document and calculate this at client side but I am curious to know if there is already an ES query which can help to get this kid of summary from the documents in the index.
You'll need to define both the tokens array and the tokens.items array as nested in order to get the correct stats.
Then, assuming your mapping looks something along the lines of
{
"mappings": {
"properties": {
"tokens": {
"type": "nested",
"properties": {
"items": {
"type": "nested"
}
}
}
}
}
}
the following query can be executed:
GET index_name/_search
{
"size": 0,
"aggs": {
"by_token_name": {
"nested": {
"path": "tokens"
},
"aggs": {
"token_name": {
"terms": {
"field": "tokens.name.keyword"
},
"aggs": {
"by_max_actions": {
"nested": {
"path": "tokens.items"
},
"aggs": {
"max_actions": {
"terms": {
"field": "tokens.items.action.keyword"
}
}
}
}
}
}
}
}
}
}
yielding these buckets:
[
{
"key" : "comp1", <--
"doc_count" : 1,
"by_max_actions" : {
"doc_count" : 3,
"max_actions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "action1", <--
"doc_count" : 2
},
{
"key" : "action2", <--
"doc_count" : 1
}
]
}
}
},
{
"key" : "comp2", <--
"doc_count" : 1,
"by_max_actions" : {
"doc_count" : 2,
"max_actions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "action2", <--
"doc_count" : 1
},
{
"key" : "action3", <--
"doc_count" : 1
}
]
}
}
}
]
which can be easily post-processed at client side.

Elasticsearch aggregations: how to get bucket with 'other' results of terms aggregation?

I use aggregation to collect data from nested field and stuck a little
Example of document:
{
...
rectangle: {
attributes: [
{_id: 'some_id', ...}
]
}
ES allows group data by rectangle.attributes._id, but is there any way to get some 'other' bucket to put there documents that were not added to any of groups? Or maybe there is a way to create query to create bucket for documents by {"rectangle.attributes._id": {$ne: "{currentDoc}.rectangle.attributes._id"}}
I think bucket would be perfect because i need to do further aggregations with 'other' docs.
Or maybe there's some cool workaround
I use query like this for aggregation
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword"
}
}
}
}
}
And get this result
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 27616,
"attributes" : {
"doc_count" : 45,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45,
"attributeOptionsCount" : {
"value" : 2
}
}
]
}
}
}
]
result like this would be perfect:
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 1000,
"attributes" : {
"doc_count" : 145,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45
},
{
"key" : "other",
"doc_count" : 100
}
]
}
}
}
]
You can make use of missing value parameter. Update aggregation as below:
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword",
"missing": "other"
}
}
}
}
}

Resources