ElasticSearch - total sum by all previous days - elasticsearch

I need to summarize all values by each day (exactly on this day) and total values by this day (sum of all values before this day, including this day values)
My code:
curl -XGET http://localhost:9200/tester/test/_search?pretty=true -d '
{
"size": 0,
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"value": {
"sum": {
"field": "my.value"
}
}
}
}
}
}
'
Output:
{
"took" : 7,
"timed_out" : false,
"_shards" : {...},
"hits" : {...},
"aggregations" : {
"articles_over_time" : {
"buckets" : [ {
"key_as_string" : "2014-02-01T00:00:00.000Z",
"key" : 1391212800000,
"doc_count" : 36,
"value" : {
"value" : 84607.0
}
}, {
"key_as_string" : "2014-03-01T00:00:00.000Z",
"key" : 1393632000000,
"doc_count" : 79,
"value" : {
"value" : 268928.0
}
},
... ]
}
}
}
This code gives me the first - summarize all values by each day (exactly on this day)
How can I gt the second one - total values by this day (sum of all values before this day, including this day values)
What do I need:
{
"took" : 7,
"timed_out" : false,
"_shards" : {...},
"hits" : {...},
"aggregations" : {
"articles_over_time" : {
"buckets" : [ {
"key_as_string" : "2014-02-01T00:00:00.000Z",
"key" : 1391212800000,
"doc_count" : 36,
"value" : {
"value" : 84607.0
},
"total" : {
"value" : 84607.0
},
}, {
"key_as_string" : "2014-03-01T00:00:00.000Z",
"key" : 1393632000000,
"doc_count" : 79,
"value" : {
"value" : 268928.0
},
"total" : {
"value" : 353535.0 /// 84607.0 + 268928.0
}
},
... ]
}
}
}

Is this because your second aggregation is nested in the "articles_over_time" section?
Does the following help? If you change from:
curl -XGET http://localhost:9200/tester/test/_search?pretty=true -d '
{
"size": 0,
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"value": {
"sum": {
"field": "my.value"
}
}
}
}
}
}
To:
curl -XGET http://localhost:9200/tester/test/_search?pretty=true -d '
{
"size": 0,
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
},
"value": {
"sum": {
"field": "my.value"
}
}
}
}

Related

es cumulative_sum cannot support the number of returned docs

I got a confusion that how to specify the number of returned docs from cumulative_sum aggs, this is my search:
{
"query": {"match_all": {}},
"size": 0,
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"cumulative_docs": {
"cumulative_sum": {"buckets_path": "_count"}
}
}
}
}
}
and it returns max number of buckets
"aggregations" : {
"group_by_date" : {
"buckets" : [
{
"key_as_string" : "2022-09-03T00:00:00.000Z",
"key" : 1662163200000,
"doc_count" : 19,
"cumulative_docs" : {
"value" : 19.0
}
},
{
"key_as_string" : "2022-09-04T00:00:00.000Z",
"key" : 1662249600000,
"doc_count" : 0,
"cumulative_docs" : {
"value" : 19.0
}
},
{
"key_as_string" : "2022-09-05T00:00:00.000Z",
"key" : 1662336000000,
"doc_count" : 0,
"cumulative_docs" : {
"value" : 19.0
}
},
{
"key_as_string" : "2022-09-06T00:00:00.000Z",
"key" : 1662422400000,
"doc_count" : 0,
"cumulative_docs" : {
"value" : 19.0
}
},
{
"key_as_string" : "2022-09-07T00:00:00.000Z",
"key" : 1662508800000,
"doc_count" : 0,
"cumulative_docs" : {
"value" : 19.0
}
},
{
"key_as_string" : "2022-09-08T00:00:00.000Z",
"key" : 1662595200000,
"doc_count" : 0,
"cumulative_docs" : {
"value" : 19.0
}
},
...
I tried to use bucket_selector to filter top10 or N in cumulative_sum but its return error such like can not support sub aggs in cumulative_sum, and also tried to use size param but not support.
if I wanna return only ten or more(I can specify it myself), how can I revise my code here?

How to get word count in docs as a aggregate over time in elastic search?

I am trying to get word count trends in docs as aggregate result . Although using the following approach I am able to get the doc count aggregation result but I am not able to find any resources using which I can get word count for the month of jan , feb & mar
PUT test/_doc/1
{
"description" : "one two three four",
"month" : "jan"
}
PUT test/_doc/2
{
"description" : "one one test test test",
"month" : "feb"
}
PUT test/_doc/3
{
"description" : "one one one test",
"month" : "mar"
}
GET test/_search
{
"size": 0,
"query": {
"match": {
"description": {
"query": "one"
}
}
},
"aggs": {
"monthly_count": {
"terms": {
"field": "month.keyword"
}
}
}
}
OUTPUT
{
"took" : 706,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"monthly_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "feb",
"doc_count" : 1
},
{
"key" : "jan",
"doc_count" : 1
},
{
"key" : "mar",
"doc_count" : 1
}
]
}
}
}
EXPECTED WORD COUNT OVER MONTH
"aggregations" : {
"monthly_count" : {
"buckets" : [
{
"key" : "feb",
"word_count" : 2
},
{
"key" : "jan",
"word_count" : 1
},
{
"key" : "mar",
"word_count" : 3
}
]
}
}
Maybe this query can help you:
GET test/_search
{
"size": 0,
"aggs": {
"monthly_count": {
"terms": {
"field": "month.keyword"
},
"aggs": {
"count_word_one": {
"terms": {
"script": {
"source": """
def str = doc['description.keyword'].value;
def array = str.splitOnToken(' ');
int i = 0;
for (item in array) {
if(item == 'one'){
i++
}
}
return i;
"""
},
"size": 10
}
}
}
}
}
}
Response:
"aggregations" : {
"monthly_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "feb",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2",
"doc_count" : 1
}
]
}
},
{
"key" : "jan",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1
}
]
}
},
{
"key" : "mar",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 1
}
]
}
}
]
}
}

How to select the last bucket in a date_histogram selector in Elasticsearch

I have a date_histogram and I can use max_bucket to get the bucket with the greatest value, but I want to select the last bucket (i.e. the bucket with the highest timestamp).
Using max_bucket to get the greatest value works OK, but I don't know what to put in the buckets_path to get the last bucket.
My mapping:
{
"ee-2020-02-28" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"date" : {
"type" : "date"
},
"frequency" : {
"type" : "long"
},
"keyword" : {
"type" : "keyword"
},
"text" : {
"type" : "text"
}
}
}
}
}
My working query, which returns the bucket for the day with higher frequency (it's named last_day because this is a WIP query to get to my goal):
{
"query": {
"range": {
"date": { /* Start away from the begining of data, so the rolling avg is full */
"gte": "2019-02-18"/*,
"lte": "2020-12-14"*/
}
}
},
"aggs": {
"palabrejas": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"nnndiario": {
"date_histogram": {
"field": "date",
"calendar_interval": "day"
},
"aggs": {
"dailyfreq": {
"sum": {
"field": "frequency"
}
}
}
},
"ventanuco": {
"avg_bucket": {
"buckets_path": "nnndiario>dailyfreq",
"gap_policy": "insert_zeros"
}
},
"last_day": {
"max_bucket": {
"buckets_path": "nnndiario>dailyfreq"
}
}
}
}
}
}
Its output (notice I replaced long parts with [...]):
{
"aggregations" : {
"palabrejas" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 20400,
"nnndiario" : {
"buckets" : [
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
{
"key_as_string" : "2020-01-02T00:00:00.000Z",
"key" : 1577923200000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
{
"key_as_string" : "2020-01-03T00:00:00.000Z",
"key" : 1578009600000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
[...]
{
"key_as_string" : "2020-01-31T00:00:00.000Z",
"key" : 1580428800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
}
]
},
"ventanuco" : {
"value" : 3290.3225806451615
},
"last_day" : {
"value" : 12000.0,
"keys" : [
"2020-01-13T00:00:00.000Z"
]
}
},
{
"key" : "rama1",
"doc_count" : 20400,
"nnndiario" : {
"buckets" : [
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
[...]
]
},
"ventanuco" : {
"value" : 3290.3225806451615
},
"last_day" : {
"value" : 12000.0,
"keys" : [
"2020-01-13T00:00:00.000Z"
]
}
},
[...]
}
]
}
}
}
I don't know what to put in last_day's buckets_path to obtain the last bucket.
You might consider using a terms aggregation instead of a date_histogram-aggregation:
"max_date_bucket_agg": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
}
}
An issue might be the granularity of your data, you may consider storing the date-value of the expected granularity (e.g. day) in a separate field and use that field in the terms-aggregation.

how to get buckets count in elasticsearch aggregations?

I'm trying to get how many buckets on an aggregation in specific datetime range,
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"datetime": {
"gte": "2017-03-01T00:00:00.000Z",
"lte": "2017-06-01T00:00:00.000Z"
}
}
},
"aggs": {
"addr": {
"terms": {
"field": "region",
"size": 10000
}
}
}
}
}
}
output:
"took" : 317,
"timed_out" : false,
"num_reduce_phases" : 3,
"_shards" : {
"total" : 1118,
"successful" : 1118,
"failed" : 0
},
"hits" : {
"total" : 1899658551,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"filtered_aggs" : {
"doc_count" : 88,
"addr" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NY",
"doc_count" : 36
},
{
"key" : "CA",
"doc_count" : 13
},
{
"key" : "JS",
"doc_count" : 7
..........
Is there a way to return both requests (buckets + total bucket count) in one search?
I'm using Elasticsearch 5.5.0
Can I get all of them?

ElasticSearch: retriving documents belonging to buckets

I am trying to retrieve documents for the past year, bucketed into 1 month wide buckets each. I will take the documents for each 1 month bucket, and then further analyze them (out of scope of my problem here). From the description, it seems "Bucket Aggregation" is the way to go, but in the "bucket" response, I am getting only the count of documents in each bucket, and not the raw documents itself. What am I missing?
GET command
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
}
}
},
"size" : 0
}
Resulting Output
{
"took" : 138,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1313058,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"DateHistogram" : {
"buckets" : [ {
"key_as_string" : "2015-02-01T00:00:00.000Z",
"key" : 1422748800000,
"doc_count" : 270
}, {
"key_as_string" : "2015-03-01T00:00:00.000Z",
"key" : 1425168000000,
"doc_count" : 459
},
(...and all the other months...)
{
"key_as_string" : "2016-03-01T00:00:00.000Z",
"key" : 1456790400000,
"doc_count" : 136009
} ]
}
}
}
You're almost there, you simply need to add the a top_hits sub-aggregation in order to retrieve some documents for each bucket:
POST /your_index/_search
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
},
"aggs": { <--- add this
"docs": {
"top_hits": {
"size": 10
}
}
}
}
},
"size" : 0
}

Resources