Why elasticsearch cannot support min_doc_count and order by _count asc? - elasticsearch

Requirements:
group by hldId having count(*) = 2
Elasticsearch query:
"aggs": {
"groupByHldId": {
"terms": {
"field": "hldId",
"min_doc_count": 2,
"order" : { "_count" : "asc" }
}
}
}
but no records are return
"aggregations" : {
"groupByHldId" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 2660,
"buckets" : [ ]
}
}
but if changed to desc , it has return
"buckets" : [
{
"key" : 200035075,
"doc_count" : 355
},
or if without min_doc_count, it also has return
"buckets" : [
{
"key" : 200000061,
"doc_count" : 1
},
So why both have mid_doc_count and asc direction it returns empty?

You can try like this, bucket selector with a custom script.
{
"aggs": {
"countfield": {
"terms": {
"field": "hldId",
"size": 100,
"order": {
"_count": "desc"
}
},
"aggs": {
"criticals": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count==2"
}
}
}
}
}
}

Related

How to sort buckets by doc_count?

GET /civile/_search
{
"size": 0,
"query": {
"match": {
"distretto": "MI"
}
},
"aggs": {
"our_buckets": {
"composite": {
"size": 1000,
"sources": [
{ "codiceoggetto": { "terms": { "field": "codiceoggetto.keyword", "order": "desc" } } }
]
}
}
}
}
My Elasticsearch query match documents by distretto = "MI".
With size = 0 I hide results.
But most important thing is that I define our_buckets aggregation.
It return 1000 keys and it do a "group by" on codiceoggetto.keyword field.
Now I want order my buckets results by doc_count! How can I do?
Here the response
{
"took" : 20,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"our_buckets" : {
"after_key" : {
"codiceoggetto" : "010001"
},
"buckets" : [
{
"key" : {
"codiceoggetto" : "490999"
},
"doc_count" : 3
},
{
"key" : {
"codiceoggetto" : "481312"
},
"doc_count" : 1
},
you can do it using bucket_sort
{
"size": 0,
"query": {
"match": {
"distretto": "MI"
}
},
"aggs": {
"our_buckets": {
"composite": {
"size": 1000,
"sources": [
{
"codiceoggetto": {
"terms": {
"field": "codiceoggetto.keyword",
"order": "desc"
}
}
}
]
},
"aggs": {
"sort_by_count": {
"bucket_sort": {
"sort": [
{
"_count": {
"order": "desc"
}
}
]
}
}
}
}
}
}

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Get top values from Elasticsearch bucket

I have some items with brand
I want to return N records, but no more than x from each bucket
So far I have my buckets grouped by brand
"aggs": {
"brand": {
"terms": {
"field": "brand"
}
}
}
"aggregations" : {
"brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "brandA",
"doc_count" : 130
},
{
"key" : "brandB",
"doc_count" : 127
}
]
}
But how do I access specific bucket and get top x values from there?
You can use top hits sub aggregation to get documents under each brand. You can sort those documents and define a size too.
{
"aggs": {
"brand": {
"terms": {
"field": "brand",
"size": 10 --> no of brands
},
"aggs": {
"top_docs": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1 --> no of documents under each brand
}
}
}
}
}
}

Bucket selector in sub aggregation or cardinality aggregation

I have this query
GET /my_index3/_search
{
"size": 0,
"aggs": {
"num1": {
"terms": {
"field": "num1.keyword",
"order" : { "_count" : "desc" }
},
"aggs": {
"count_of_distinct_suffix": {
"cardinality" :{
"field" : "suffix.keyword"
},
"aggs": {
"filter_count": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.doc_count == 2"
}
}
}
}
}
}
}
}
Output:
"key" : "1563866656878888",
"doc_count" : 42,
"count_of_distinct_suffix" : {
"value" : 2
}
},
{
"key" : "1563866656871111",
"doc_count" : 40,
"count_of_distinct_suffix" : {
"value" : 2
}
},
{
"key" : "1563867854325555",
"doc_count" : 36,
"count_of_distinct_suffix" : {
"value" : 1
}
},
{
"key" : "1563867854323333",
"doc_count" : 12,
"count_of_distinct_suffix" : {
"value" : 1
}
},
I want to see only the results which have "count_of_distinct_suffix" : { "value" : 2 }
I'm thinking about bucket selector aggregation but it's impossible to add it into the cardinality aggs...
"aggs": {
"my_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.doc_count == 2"
}
}
}
It gives me the following error: Aggregator [count_of_distinct_suffix] of type [cardinality] cannot accept sub-aggregations
Do you guys have any idea to solve it?
Thank you very much for any help in advance !!
You don't have to add the bucket_selector aggs as a sub aggregation of cardinality aggs. Instead you should add it parallel to it as below:
{
"size": 0,
"aggs": {
"num1": {
"terms": {
"field": "num1.keyword",
"order": {
"_count": "desc"
}
},
"aggs": {
"count_of_distinct_suffix": {
"cardinality": {
"field": "suffix.keyword"
}
},
"my_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "count_of_distinct_suffix"
},
"script": "params.the_doc_count == 2"
}
}
}
}
}
}

Elastic Search: Selecting multiple vlaues in aggregates

In Elastic Search I have the following index with 'allocated_bytes', 'total_bytes' and other fields:
{
"_index" : "metrics-blockstore_capacity-2017_06",
"_type" : "datapoint",
"_id" : "AVzHwgsi9KuwEU6jCXy5",
"_score" : 1.0,
"_source" : {
"timestamp" : 1498000001000,
"resource_guid" : "2185d15c-5298-44ac-8646-37575490125d",
"allocated_bytes" : 1.159196672E9,
"resource_type" : "machine",
"total_bytes" : 1.460811776E11,
"machine" : "2185d15c-5298-44ac-8646-37575490125d"
}
I have the following query to
1)get a point for 30 minute interval using date-histogram
2)group by field on resource_guid.
3)max aggregate to find the max value.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValue": {
"max": {
"field": "allocated_bytes"
}
}
}
},
"sumUnique": {
"sum_bucket": {
"buckets_path": "groupByField>maxValue"
}
}
}
}
}
}
But with this query I am able to get only allocated_bytes, but I need to have both allocated_bytes and total_bytes at the result point.
Following is the result from the above query:
{
"key_as_string" : "2017-06-20T21:00:00.000Z",
"key" : 1497992400000,
"doc_count" : 9,
"groupByField" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "2185d15c-5298-44ac-8646-37575490125d",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
}, {
"key" : "c3513cdd-58bb-4f8e-9b4c-467230b4f6e2",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156165632E9
}
}, {
"key" : "eff13403-9737-4d08-9dca-fb6c12c3a6fa",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
} ]
},
"sumUnique" : {
"value" : 3.468529664E9
}
}
I do need both allocated_bytes and total_bytes. How do I get multiple fields( allocated_bytes, total_bytes) for each point?
For example:
"sumUnique" : {
"Allocatedvalue" : 3.468529664E9,
"TotalValue" : 9.468529664E9
}
or like this:
"allocatedBytessumUnique" : {
"value" : 3.468529664E9
}
"totalBytessumUnique" : {
"value" : 9.468529664E9
},
You can just add another aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValueAllocated": {
"max": {
"field": "allocated_bytes"
}
},
"maxValueTotal": {
"max": {
"field": "total_bytes"
}
}
}
},
"sumUniqueAllocatedBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueAllocated"
}
},
"sumUniqueTotalBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueTotal"
}
}
}
}
}
}
I hope you are aware that sum_bucket calculates sibling aggregations only, in this case gives sum of max values, not the sum of total_bytes. If you want to get sum of total_bytes you can use sum aggregation

Resources