Elasticsearch How to count total docs by date - elasticsearch

As my theme, I wanna count docs the day and before by date, it's sample to understand that the chart.
{"index":{"_index":"login-2015.12.23","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-23T12:00:00Z"}
{"index":{"_index":"login-2015.12.23","_type":"logs"}}
{"uid":"2","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-23T12:00:00Z"}
{"index":{"_index":"login-2015.12.24","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-24T12:00:00Z"}
{"index":{"_index":"login-2015.12.25","_type":"logs"}}
{"uid":"1","register_time":"2015-12-23T12:00:00Z","login_time":"2015-12-25T12:00:00Z"}
As you see, index login-2015.12.23 has two docs, index login-2015.12.24 has one doc, index login-2015.12.23 has one doc.
And now I wanna get the result
{
"hits" : {
"total" : 6282,
"max_score" : 1.0,
"hits" : []
},
"aggregations" : {
"group_by_date" : {
"buckets" : [
{
"key_as_string" : "2015-12-23T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 2,
},
{
"key_as_string" : "2015-12-24T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 3,
},
{
"key_as_string" : "2015-12-25T12:00:00Z",
"key" : 1662163200000,
"doc_count" : 4,
}
]
}
If I count the date 2015-12-24T12:00:00Z and it means I must count day 2015-12-23T12:00:00Z and 2015-12-24T12:00:00Z at the same time.
In my project I have many indices like that, and I searching many ways to make this goal come true but not, this is my demo:
{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"intersect": {
"scripted_metric": {
"init_script": "state.inner=[]",
"map_script": "state.inner.add(params.param1 == 3 ? params.param2 * params.param1 : params.param1 * params.param2)",
"combine_script": "return state.inner",
"reduce_script": "return states",
"params": {
"param1": 3,
"param2": 5
}
}
}
}
}
}
}
I wanna group by date, and use scripted_metric to iter the date list, not the second iteration just can in its bucket and not for all the document, so do anyone has better idea to solve this problem?

You can simply use the cumulative sum pipeline aggregation
{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "login_time",
"interval": "day"
},
"aggs": {
"cumulative_docs": {
"cumulative_sum": {
"buckets_path": "_count"
}
}
}
}
}
}
And the results will look like this:
"aggregations" : {
"group_by_date" : {
"buckets" : [
{
"key_as_string" : "2015-12-23T00:00:00.000Z",
"key" : 1450828800000,
"doc_count" : 2,
"cumulative_docs" : {
"value" : 2.0
}
},
{
"key_as_string" : "2015-12-24T00:00:00.000Z",
"key" : 1450915200000,
"doc_count" : 1,
"cumulative_docs" : {
"value" : 3.0
}
},
{
"key_as_string" : "2015-12-25T00:00:00.000Z",
"key" : 1451001600000,
"doc_count" : 1,
"cumulative_docs" : {
"value" : 4.0
}
}
]
}
}

Related

Nested Aggregation for AND Query Not Working

Please can someone help with the below Question.
https://discuss.elastic.co/t/nested-aggregation-with-and-always-return-0-match/315722?u=chattes
I have used following aggregations
1. Terms aggregation
2. Bucket selector
3. Nested aggregation
First I have grouped by user id using terms aggregation. Then further grouped by skill Id. Using bucket selector I have filtered users which have documents under two skills.
Query
GET index5/_search
{
"size": 0,
"aggs": {
"users": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"skills": {
"nested": {
"path": "skills"
},
"aggs": {
"filter_skill": {
"terms": {
"field": "skills.id",
"size": 10,
"include": [
553,
426
]
}
}
}
},
"bucket_count": {
"bucket_selector": {
"buckets_path": {
"skill_count": "skills>filter_skill._bucket_count"
},
"script": "params.skill_count ==2"
}
}
}
}
}
}
Results
"aggregations" : {
"users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1,
"skills" : {
"doc_count" : 3,
"filter_skill" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "426",
"doc_count" : 1
},
{
"key" : "553",
"doc_count" : 1
}
]
}
}
},
{
"key" : 2,
"doc_count" : 1,
"skills" : {
"doc_count" : 2,
"filter_skill" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "426",
"doc_count" : 1
},
{
"key" : "553",
"doc_count" : 1
}
]
}
}
}
]
}

How to exclude the buckets having doc count equal to 0

I want to exclude those buckets from the date histogram aggregation response, whose doc count is equal to 0. And then, get the count of the filtered buckets.
The query is :
GET metricbeat-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"host.cpu.usage": {
"gte": 0.8
}
}
},
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:18:00.000Z"
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}
The response that I get right now is :
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 38,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"hostName" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "datahot01",
"doc_count" : 3,
"docsOverTimeFrame" : {
"buckets" : [
{
"key_as_string" : "2022-09-22T10:16:00.000Z",
"key" : 1663841760000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:10.000Z",
"key" : 1663841770000,
"doc_count" : 1
},
{
"key_as_string" : "2022-09-22T10:16:20.000Z",
"key" : 1663841780000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:30.000Z",
"key" : 1663841790000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:40.000Z",
"key" : 1663841800000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:16:50.000Z",
"key" : 1663841810000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:00.000Z",
"key" : 1663841820000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:10.000Z",
"key" : 1663841830000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:20.000Z",
"key" : 1663841840000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:30.000Z",
"key" : 1663841850000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:40.000Z",
"key" : 1663841860000,
"doc_count" : 0
},
{
"key_as_string" : "2022-09-22T10:17:50.000Z",
"key" : 1663841870000,
"doc_count" : 0
}
]
}
}
]
}
}
}
So, if I am able to exclude those buckets that have doc count = 0, then on the basis of the number of buckets (that is bucket count), I want to check whether the count of buckets formed is equal to 12 or not (which I am doing using the bucket selector aggregation).
Is there some way to exclude the buckets having doc count = 0, and get the bucket count = 2 instead of 12
I was able to solve the above use case, by using a pipeline aggregation (i.e a bucket_selector aggregation) inside of the date histogram aggregation.
The modified query is :
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2022-09-22T10:16:00.000Z",
"lte": "2022-09-22T10:22:00.000Z"
}
}
},
{
"range": {
"system.cpu.total.norm.pct": {
"gte": 0.8
}
}
}
]
}
},
"aggs": {
"hostName": {
"terms": {
"field": "host.name"
},
"aggs": {
"docsOverTimeFrame": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "10s"
},
"aggs": {
"histogram_doc_count": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 0"
}
}
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "docsOverTimeFrame._bucket_count"
},
"script": {
"source": "params.count == 12"
}
}
}
}
}
}
}

ELASTICSEARCH - Total doc_count aggregations

I am looking for a way to sum up the total of an aggregation that I have defined in the query.
For example:
{
"name" : false,
"surname" : false
},
{
"name" : false,
"surname" : false
}
Query:
GET index/_search?size=0
{"query": {
"bool": {
"must": [
{"term": {"name": false}},
{"term": {"surname": false}}
]
}
},
"aggs": {
"name": {
"terms": {
"field": "name"
}
},
"surname": {
"terms": {
"field": "surname"
}
}
}
}
The query returns the value for each field "name" and "surname" with value "false".
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
},
"surname" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
}
}
}
Is it possible to return the total sum of doc_count, so that in this situation it would be "doc_count" : 2 + "doc_count" : 2 == 4?
I've been trying to do it with script but since they are boolean values it doesn't work.
The functionality that most closely resembles the solution I am looking for is sum_bucket.
GET index/_search?filter_path=aggregations
{
"aggs": {
"surname_field": {
"terms": {
"field": "surname",
"size": 1
}
},
"sum": {
"sum_bucket" : {
"buckets_path":"surname_field>_count"
}
}
}
}
For this specific case where it is a simple JSON, the result of the query is the same as the hits.total.value (number of documents) with filtering to boolean field surname:false or name:false.
But for situations with Json with more fields we can specify the number of times we have a result in our database.
With this result I wanted to find the total number of hits and not the number of documents in the result.

Perform a pipelines aggregation over the full set of potential buckets

When using the _search API of Elasticsearch, if you set size to 10, and perform an avg metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits array.
On the other hand, if you perform a terms aggregation and set the size of the terms aggregation to be 10, then performing an avg_buckets aggregation on those terms buckets will calculate an average over only those 10 buckets - not all potential buckets.
How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets array?
To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.
Here are two examples of the documents I tested this on:
{
"type": "top",
"color": "black",
"price": 19,
"store": "Macy's"
}
{
"type": "fez",
"color": "red",
"price": 94,
"store": "Walmart"
}
If I want to find the average price of all the hats I have sold, I can run this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_hat_price": {
"avg": {
"field": "price"
}
}
}
}
And average_hat_price will be the same whether size is set to 0, 3, or whatever.
OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"by_store": {
"terms": {
"field": "store.keyword",
"size": 3
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"avg sales at a store": {
"avg_bucket": {
"buckets_path": "by_store>sales_count"
}
}
}
}
which yields a response of
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
}
]
},
"avg sales at a store" : {
"value" : 4.666666666666667
}
}
The problem is that avg sales at a store is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Target",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Harrod's",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Men's Warehouse",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Sears",
"doc_count" : 1,
"sales_count" : {
"value" : 1
}
}
]
},
"avg sales at a store" : {
"value" : 3.142857142857143
}
}
So the average number of hats sold per store is 3.1, not 4.6. But in the buckets array I want to see only the top 3 stores.
You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.
Here is the data setup:
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
Here is the tricky query:
GET hat_sales/_search?size=0
{
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 2
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
},
"cheat": {
"filters": {
"filters": {
"all": {
"exists": {
"field": "storename"
}
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "storename"
}
},
"unique_count": {
"cardinality": {
"field": "storename"
}
},
"total_average": {
"bucket_script": {
"buckets_path": {
"total": "count",
"unique": "unique_count"
},
"script": "params.total / params.unique"
}
}
}
}
}
}
This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs. I restricted the num_docs to only docs that actually have the storefield name.
I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).
Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script to finish it off.
All in all, here is the slightly mangled result :D
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cheat" : {
"buckets" : {
"all" : {
"doc_count" : 6,
"count" : {
"value" : 6
},
"unique_count" : {
"value" : 3
},
"total_average" : {
"value" : 2.0
}
}
}
},
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
}
]
},
"average_sales_count" : {
"value" : 2.5
}
}
}
Note that cheat.buckets.all.total_average is 2.0 (the true average) while the old way (pipeline average) is the non-global average of 2.5

Elasticsearch aggregations: how to get bucket with 'other' results of terms aggregation?

I use aggregation to collect data from nested field and stuck a little
Example of document:
{
...
rectangle: {
attributes: [
{_id: 'some_id', ...}
]
}
ES allows group data by rectangle.attributes._id, but is there any way to get some 'other' bucket to put there documents that were not added to any of groups? Or maybe there is a way to create query to create bucket for documents by {"rectangle.attributes._id": {$ne: "{currentDoc}.rectangle.attributes._id"}}
I think bucket would be perfect because i need to do further aggregations with 'other' docs.
Or maybe there's some cool workaround
I use query like this for aggregation
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword"
}
}
}
}
}
And get this result
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 27616,
"attributes" : {
"doc_count" : 45,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45,
"attributeOptionsCount" : {
"value" : 2
}
}
]
}
}
}
]
result like this would be perfect:
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 1000,
"attributes" : {
"doc_count" : 145,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45
},
{
"key" : "other",
"doc_count" : 100
}
]
}
}
}
]
You can make use of missing value parameter. Update aggregation as below:
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword",
"missing": "other"
}
}
}
}
}

Resources