elasticsearch sum more then 10000 results - elasticsearch

I would like to know how to do a sum aggregation with more than 10000 results plz ?
I can't find it in the docs
Thank you.
GET index/_search?pretty
{
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "2022-01-01 00:00",
"format": "yyyy-MM-dd HH:mm"
}
}
}
]
}
},
"aggs": {
"nb_sms": {
"sum": {
"field": "sms_count"
}
}
},
"size": 0
}

You can do partitions and then sum the results of the partitions.
you can check this link: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
It is going to split you data not evenly but it is not going to duplicate anything.
So, you can do aggregation for the partitions and a bucket_sum of that aggregation, and a subaggregation under the partition for the sum.

Related

Use distinct field for count with significant_terms in Elastic Search

Is there a way to get the signification_terms aggregation to use document counts based on a distinct field?
I have an index with posts and their hashtags but they are from multiple sources so there will be multiple ones with the same permalink field but I only want to count unique permalinks per each hashtag. I have managed to get the unique totals using the cardinality aggregation: (ie "cardinality": { field": "permalink.keyword"}) but can't work out how to do this with the Significant terms aggregation. My query is as follows:
GET /posts-index/_search
{
"aggregations": {
"significant_hashtag": {
"significant_terms": {
"background_filter": {
"bool": {
"filter": [
{
"range": {
"created": {
"gte": 1656414622,
"lte": 1656630000
}
}
}
]
}
},
"field": "hashtag.keyword",
"mutual_information": {
"background_is_superset": false,
"include_negatives": true
},
"size": 100
}
}
},
"query": {
"bool": {
"filter": [
{
"range": {
"created": {
"gte": 1656630000,
"lte": 1659308400
}
}
}
]
}
},
"size": 0
}

Defining a time range for aggregation in elasticsearch

I've got an index in ElasticSearch with documents having info about user connections to my platform. I want to build a query with day aggregation where I can count all users connected every day between two given dates.
I have 3 relevant fields to do so: user_id, connection_time_start, connection_time_end. I was doing the query this way:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"connection_time_start": {
"lte": "2017-08-04T23:59:59"
}
}
},
{
"range": {
"connection_time_end": {
"gte": "2017-08-02T00:00:00"
}
}
}
]
}
},
"aggs": {
"franja_horaria": {
"date_histogram": {
"field": "connection_time_start",
"interval": "day",
"format": "yyyy-MM-dd"
},
"aggs": {
"ids": {
"cardinality": {
"field": "user_id"
}
}
}
}
}
}
This query has given as a result buckets containing the number of users that had the starting connection at day 2, 3 & 4 of August. The problem is that there are users with connections starting on day 2 and ending on day 3 and even on day 4.
These users should compute for the connected user count for each day but as I'm doing the aggregation with the connection_time_start only counts for that day.
I've tried to add a range in the aggregation some thing like this(https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-daterange-aggregation.html) but haven't got a good result.
Can anybody help me with this? Thanks in advance!

elasticsearch averaging a field on a bucket

I am a newbie to elasticsearch, trying to understand how aggregates and metrics work. I was particularly running an aggregate query to retrieve average num of bytesOut based on clientIPHash from an elasticsearch instance. The query I created (using kibana) is as follows:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": 1476177616965,
"lte": 1481361616965,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "ClientIP_Hash",
"size": 50,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "Bytes Out"
}
}
}
}
}
}
It gives me some output (supposed to be avg) grouped on clientIPHash like below:
ClientIP_Hash: Descending Average Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
...
The problem is, if I replace the avg with sum or min or any other metric type, I still get same values.
ClientIP_Hash: Descending Sum of Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
I checked the query generated by kibana, and it seems to correctly put the keyword 'sum' or 'avg' accordingly. I am puzzled why I get the same values for avg and sum or any other metric.
Could you see if the sample data set of yours have more values. As min, max and Avg remains the same if you have only one value.
Thanks

How to limit a date histogram aggregation of nested documents to a specific date range?

Version
Using Elasticsearch 1.7.2
Objective
I would like to create a graph of the number of predictions made by users per day for the last n days. In this case, 10 days.
Current query
{
"size": 0,
"aggs": {
"predictions": {
"nested": {
"path": "user_answers"
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
Issue
This query will return a histogram but will return buckets for all available dates across all documents. It doesn't restrict to a specific date range.
What have I tried?
I've tried a number of approaches to solving this, all of which have failed.
* Range filter, then histogram that
* Date range aggregation, then histogram the buckets
* Using extended_bounds with, full dates, now-10d and also timestamps
* Trying a range filter inside the histogram aggregation
Any guidance would be appreciated! Thanks.
query didn't work for me in that situation, what I used is a third aggs:
{
"size": 0,
"aggs": {
"user_answers": {
"nested": { "path": "user_answers" },
"aggs": {
"timed_user_answers": {
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
}
}
One aggs specifies nested, one specifies filter, and the last specifies the actual aggregation. Don't know why this syntax makes sense, but you seem to not be able to use two on the same aggs.
You need to add a query. Query can be anything except from post_filter. It should be nested and contain date range. One of the ways is to define a constant score query. Inside constant score query, use a nested filter which should use a range filter.
{
"query": {
"constant_score": {
"filter": {
"nested": {
"path": "user_answers",
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
}
}
}
}
}
}
Confirm if this works for you.

How to Use pagination (size and from) in elastic search aggregation?

How to Use pagination (size and from) in elasticsearch aggregation , I used Size and from in agreggition it,s throw exception for exmaple.
I wanna query like?
GET /index/nameorder/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"match": {
"projectId": "10057"
}
}
]
}
},
"filter": {
"range": {
"purchasedDate": {
"from": "2012-02-05T00:00:00",
"to": "2015-02-11T23:59:59"
}
}
}
}
},
"aggs": {
"group_by_a": {
"terms": {
"field": "promocode",
"size": 40,
"from": 40
},
"aggs": {
"TotalPrice": {
"sum": {
"field": "subtotalPrice"
}
}
}
}
}
}
As of now , this feature is not supported.
There is a bug on this , but its still in discuss mode.
Issue - https://github.com/elasticsearch/elasticsearch/issues/4915
In order to implement pagination on top of aggregation in Elasticsearch, you need to do as follows.
Define the size of each batch.
Run cardinality count
Then according to cardinality define partition = (Cardinality count / size )(this size must be smaller than fetch size)
now you can iterate over the partitions using partition filter - Please note size must be big enough cause the results are not equally splitted between the buckets.

Resources