Finding sum of the "key" values in bucket aggregations in Elasticsearch - elasticsearch

I have the following ES query:
GET database/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"nested": {
"query": {
"term": {
"colleges.institution_full_name": {
"value": "Academy of Sciences",
"boost": 1.0
}
}
},
"path": "colleges"
}
}
]
}
},
"_source": false,
"aggs": {
"publication_years": {
"terms": {
"field": "publication_year"
}
}
}
}
And I got the following response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 232,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"publication_years" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2016,
"doc_count" : 119
},
{
"key" : 2017,
"doc_count" : 90
},
{
"key" : 2018,
"doc_count" : 22
},
{
"key" : 2019,
"doc_count" : 1
}
]
}
}
}
Now I want to calculate the average of the key values of publication years, i.e, average of 2016, 2017, 2018 & 2019. So how can I modify my ES query to get the average of publication years instead of getting every year individually. I tried using "avg" aggregation, but its also taking "doc_count" in consideration while calculating the average.

try it
POST database/_search
{
"size": 0,
"aggs": {
"groupByYear": {
"terms": {
"field": "publication_year"
},
"aggs": {
"avgYear": {
"avg": {
"field": "publication_year"
}
}
}
},
"avg_year": {
"avg_bucket": {
"buckets_path": "groupByYear>avgYear"
}
}
}
}

It's not clear what you want, do your want avg of 2016,2017,2018,2019?
it means you want 2017.5?

Related

Query filter for searching rollup index works with epoch time fails with date math

`How do we query (filter) a rollup index?
For example, based on the query here
Request:
{
"size": 0,
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}
Response:
{
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : {
"value": 0,
"relation": "eq"
},
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
}
How to filter say last 3 days, is it possible?
For a test case, I used fixed_interval rate of 1m (one minute, and also 60 minutes) and I tried the following and the error was all query shards failed. Is it possible to query filter rollup agggregations?
Test Query for searching rollup index
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "now-3d/d",
"lt": "now/d"
}
}
}
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}

How to exclude the buckets having doc count equal to 0

I want to exclude those buckets from the date histogram aggregation response, whose doc count is equal to 0. And then, get the count of the filtered buckets.
The query is :
GET metricbeat-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"host.cpu.usage": {
"gte": 0.8
}
}
},
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:18:00.000Z"
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}
The response that I get right now is :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 38,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"hostName" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "datahot01",
"doc_count" : 3,
"docsOverTimeFrame" : {
"buckets" : [
{
"key_as_string" : "2022-09-22T10:16:00.000Z",
"key" : 1663841760000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:10.000Z",
"key" : 1663841770000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:20.000Z",
"key" : 1663841780000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:30.000Z",
"key" : 1663841790000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:40.000Z",
"key" : 1663841800000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:50.000Z",
"key" : 1663841810000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:00.000Z",
"key" : 1663841820000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:10.000Z",
"key" : 1663841830000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:20.000Z",
"key" : 1663841840000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:30.000Z",
"key" : 1663841850000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:40.000Z",
"key" : 1663841860000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:50.000Z",
"key" : 1663841870000,
"doc_count" : 0
}
]
}
}
]
}
}
}
So, if I am able to exclude those buckets that have doc count = 0, then on the basis of the number of buckets (that is bucket count), I want to check whether the count of buckets formed is equal to 12 or not (which I am doing using the bucket selector aggregation).
Is there some way to exclude the buckets having doc count = 0, and get the bucket count = 2 instead of 12
I was able to solve the above use case, by using a pipeline aggregation (i.e a bucket_selector aggregation) inside of the date histogram aggregation.
The modified query is :
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:22:00.000Z"
}
}
},
{
"range": {
"system.cpu.total.norm.pct": {
"gte": 0.8
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
},
"aggs": {
"histogram_doc_count": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 0"
}
}
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}

How to use composite aggregation with a single bucket

The following composite aggregation query
{
"query": {
"range": {
"orderedAt": {
"gte": 1591315200000,
"lte": 1591438881000
}
}
},
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"aggregation_target": {
"terms": {
"field": "supplierId"
}
}
}
]
},
"aggs": {
"aggregated_hits": {
"top_hits": {}
},
"filter": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 2"
}
}
}
}
}
}
returns something like below.
{
"took" : 67,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 34,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_buckets" : {
"after_key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"buckets" : [
{
"key" : {
"aggregation_target" : "0HQI2G0K000100G8"
},
"doc_count" : 4,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G18G00100G8"
},
"doc_count" : 11,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"doc_count" : 16,
"aggregated_hits" : {...}
}
]
}
}
}
The aggregated results are put into buckets based on the condition set in the query.
Is there any way to put them in a single bucket and paginate thought the whole result(i.e. 31 documents in this case)?
I don't think you can. A doc's context doesn't include information about other docs unless you perform a cardinality, scripted_metric or terms aggregation. Also, once you bucket your docs based on the supplierId, it'd sort of defeat the purpose of aggregating in the first place...
What you wrote above is as good as it gets and you'll have to combine the aggregated_hits within some post processing step.

Is it possible with aggregation to amalgamate all values of an array property from all grouped documents into the coalesced document?

I have documents with the format similar to the following:
[
{
"name": "fred",
"title": "engineer",
"division_id": 20
"skills": [
"walking",
"talking"
]
},
{
"name": "ed",
"title": "ticket-taker",
"division_id": 20
"skills": [
"smiling"
]
}
]
I would like to run an aggs query that would show the complete set of skills for the division: ie,
{
"aggs":{
"distinct_skills":{
"cardinality":{
"field":"division_id"
}
}
},
"_source":{
"includes":[
"division_id",
"skills"
]
}
}
.. so that the resulting hit would look like:
{
"division_id": 20,
"skills": [
"walking",
"talking",
"smiling"
]
}
I know I can retrieve inner_hits and iterate through the list and amalgamate values "manually". I assume it would perform better if I could do it a query.
Just pipe two Terms Aggregation queries as shown below:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_division_ids": {
"terms": {
"field": "division_id",
"size": 10
},
"aggs": {
"my_skills": {
"terms": {
"field": "skills", <---- If it is not keyword field use `skills.keyword` field if using dynamic mapping.
"size": 10
}
}
}
}
}
}
Below is the sample response:
Response:
{
"took" : 490,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_division_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 20, <---- division_id
"doc_count" : 2,
"my_skills" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ <---- Skills
{
"key" : "smiling",
"doc_count" : 1
},
{
"key" : "talking",
"doc_count" : 1
},
{
"key" : "walking",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope this helps!

how to get buckets count in elasticsearch aggregations?

I'm trying to get how many buckets on an aggregation in specific datetime range,
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"datetime": {
"gte": "2017-03-01T00:00:00.000Z",
"lte": "2017-06-01T00:00:00.000Z"
}
}
},
"aggs": {
"addr": {
"terms": {
"field": "region",
"size": 10000
}
}
}
}
}
}
output:
"took" : 317,
"timed_out" : false,
"num_reduce_phases" : 3,
"_shards" : {
"total" : 1118,
"successful" : 1118,
"failed" : 0
},
"hits" : {
"total" : 1899658551,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"filtered_aggs" : {
"doc_count" : 88,
"addr" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NY",
"doc_count" : 36
},
{
"key" : "CA",
"doc_count" : 13
},
{
"key" : "JS",
"doc_count" : 7
..........
Is there a way to return both requests (buckets + total bucket count) in one search?
I'm using Elasticsearch 5.5.0
Can I get all of them?

Resources