elasticsearch range aggregation with fixed buckets values - elasticsearch

I need to create an aggregation, like "range" but I need to specify the "where" clause of each bucket.
For example if we aggregate on an "age" field, what range agg offer is:
bucket 1: to 10
bucket 2: from 10 to 50
bucket 3: from 50
what I need is:
bucket 1: [5,4334,211 and 76]
bucket 2: [66 and 435]
bucket 3: [5455, 7968, 1, 443 and 765]
I don't want to create 3 "terms" aggregations with the "include" property, what I need is one aggregation with 3 buckets (just like range offers).
Any ideas or alternatives ?

Only the first bucket would cause an issue since the range is discontinued, but all the other ones can be specified easily with a from/to constraint in range buckets. I suggest something like this:
{
"aggs" : {
"age_ranges" : {
"range" : {
"field" : "age",
"ranges" : [
{ "from": 5, "to": 6 }, <--- only 5
{ "from": 10, "to": 13 }, <--- 10, 11, 12
{ "from": 13, "to": 15 }, <--- 13, 14
{ "from": 20, "to": 26 } <--- 20, 21, 22, 23, 24, 25
]
}
}
}
}

So, there is another way of having a similar result using terms aggregation with scripting.
"aggs": {
"age_ranges": {
"terms": {
"script": {
"inline": "if(ageMap.containsKey(doc['age'].value)){ageMap.get(doc['age'].value)} else {'<unmapped>'} ",
"params": {
"contentIdMapping": {
"1": "bucket-3",
"5": "bucket-1",
"66": "bucket-2",
"76": "bucket-1",
"211": "bucket-1",
"435": "bucket-2",
"443": "bucket-3",
"765": "bucket-3",
"4334": "bucket-1",
"5455": "bucket-3",
"7968": "bucket-3"
}
}
}
}
}
}

Related

Elasticsearch sort by filtered value

I'm using Elasticsearch 7.12, upgrading to 7.17 soon.
The following description of my problem has had the confusing business logic for my exact scenario removed.
I have an integer field in my document named 'Points'. It will usually contain 5-10 values, but may contain more, probably not more than 100 values. Something like:
Document 1:
{
"Points": [3, 12, 34, 60, 1203, 70, 88]
}
Document 2:
{
"Points": [16, 820, 31, 60]
}
Document 3:
{
"Points": [93, 20, 55]
}
My search needs to return documents with values within a range, such as between 10 and 19 inclusive. That part is fine. However I need to sort the results by the values found in that range. From the example above, I might need to find values between 30-39, sorted by the value in that range ascending - it should return Document 2 (containing value of 31) followed by Document 1 (containing value of 34).
Due to the potential range of values and searches I can't break this field down into fields like 0-9, 10-19 etc. to search on them independently - there would be many thousands of fields.
The documents themselves are otherwise quite large and there are a large number of them, so I have been advised to avoid nested fields if possible.
Can I apply a filter to a sort? Do I need a script to achieve this?
Thanks.
There are several ways of doing this:
Histogram aggregation
Aggregate your documents using a histogram aggregation with "hard bounds". Example query
POST /my_index/_search?size=0
{
"query": {
"constant_score": { "filter": { "range": { "Points": { "gte": "30", "lte" : "40" } } } }
},
"aggs": {
"points": {
"histogram": {
"field": "Points",
"interval": 10,
"hard_bounds": {
"min": 30,
"max": 40
}
},
"aggs" : {"top" : {"top_hits" : {}}}
}
}
}
THis will aggregate all the documents as long as they fall in that range, and the first bucket in the results, will contain the document that you want.
Try this with an extended terms aggregation:
If the range you want is relatively small. For eg like you mentioned "30 - 39", a simple terms aggregation on the results with an inclusion for all the numbers in that range, will also give you the desired result.
Example Query:
POST /my_index/_search?size=0
{
"query": {
"constant_score": { "filter": { "range": { "Points": { "gte": "30", "lte" : "40" } } } }
},
"aggs": {
"points": {
"terms": {
"field": "Points",
"include" : ["30","31"....,"39"]
},
"aggs" : {"top": {"top_hits" : {}}}
}
}
}
Each bucket in the terms aggregation results will contain the documents that have that particular "Point" occurring at least once. The first document in the first bucket has what you want.
The third option involves building a runtime field, that will trim the points to contain only the points between your range, and then sorting in ascending order on that field. But that will be slower.
HTH.

Issue with date Histrogram aggrgation with filter

I am trying to get date histrogram for a timestamp field for a specific period. I am using the following query,
{
"aggs" : {
"dataRange" : {
"filter": {"range" : { "#timestamp" :{ "gte":"2020-02-28T17:20:10Z","lte":"2020-03-01T18:00:00Z" } } },
"aggs" : {
"severity_over_time" :{
"date_histogram" : { "field" : "#timestamp", "interval" : "28m" }
}}}
},"size" :0
}
The following result I got,
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 32,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"dataRange": {
"doc_count": 20,
"severity_over_time": {
"buckets": [
{
"key_as_string": "2020-02-28T17:04:00.000Z",
"key": 1582909440000,
"doc_count": 20
}
]
}
}
}
}
The the start of the histogram range ("key_as_string" ) goes outside of my filter criteria! My input filter is from "2020-02-28T17:20:10Z" but the key_as_string in the result is "2020-02-28T17:04:00.000Z" which is outside the range filter!
I tried looking at the docs but no avail. Am I missing something here?
I guess that has to do with the way a Range or a bucket is calculated. My understanding is that 28m of range would have to be maintained throughout i.e. the bucket size must be consistent.
Notice that 28m of range difference is maintained perfectly and in a way first and the last bucket seem to be stretched just to accommodate this 28m range.
Notice that logically, your result documents are all in the right buckets and that documents which are outside the filter range would not be in the aggregation query irrespective of the key_as_string appears within their limits.
Basically ES doesn't guarantee that the range values i.e. key_as_string or start and end values of buckets created may fall accurately within the scope of the filter you've provided but it does guarantee that only the documents filtered as per that range filtered query would be considered for evaluation.
You can say that bucket values are nearest possible values or approximations.
If you want to be sure of the filtered documents, just remove the filter from aggregation and use that in the query as below and remove size: 0
Notice I've made use of offset which would change the start value of the specified bucket. Perhaps that is something you are looking for.
Also one more thing, I've made use of min_doc_count just so you can filter out empty buckets.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2020-02-28T17:20:10Z",
"lte": "2020-03-01T18:00:01Z"
}
}
}
]
}
},
"aggs": {
"severity_over_time": {
"date_histogram": {
"field": "#timestamp",
"interval": "28m",
"offset": "+11h",
"min_doc_count": 1
}
}
}
}

How to do a max date aggregation over the same document in Elasticsearch?

I have millions of documents with a block like this one:
{
"useraccountid": 123456,
"purchases_history" : {
"last_updated" : "Sat Apr 27 13:41:46 UTC 2019",
"purchases" : [
{
"purchase_id" : 19854284,
"purchase_date" : "Jan 11, 2017 7:53:35 PM"
},
{
"purchase_id" : 19854285,
"purchase_date" : "Jan 12, 2017 7:53:35 PM"
},
{
"purchase_id" : 19854286,
"purchase_date" : "Jan 13, 2017 7:53:35 PM"
}
]
}
}
I am trying to figure out how I can do something like:
SELECT useraccountid, max(purchases_history.purchases.purchase_date) FROM my_index GROUP BY useraccountid
I only found the max aggregation but it aggregates over all the documents in the index, but this is not what I need. I need to find the max purchase date for each document. I believe there must be a way to iterate over each path purchases_history.purchases.purchase_date of each document to identify which one is the max purchase date, but I really cannot find how to do it (if this is really the best way of course).
Any suggestion?
I assume that your field useraccountid is unique. You will have to do a terms aggregation, inside do the max aggregation. I can think of this:
"aggs":{
"unique_user_ids":{
"terms":{
"field": "useraccountid",
"size": 10000 #Default value is 10
},
"aggs":{
"max_date":{
"max":{
"field": "purchases_history.purchases.purchase_date"
}
}
}
}
}
In the aggregations field you'll see first the unique user ID and inside, their max date.
Note the 10,000 in the size. The terms aggregation is only recommended to return until 10,000 results.
If you need more, you can play with the Composite aggregation. With that, you can paginate your results and your cluster won't get performance issues.
I can think of the following if you want to play with Composite:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"size": 10000, #Default set to 10
"sources" : [
{ "user_id": { "terms": {"field": "useraccountid" } } },
{ "product": { "max": { "field": "purchases_history.purchases.purchase_date" } } }
]
}
}
}
}
After running the query, it will return a field called after_key. With that field you can paginate your results in pages of 10,000 elements. Take a look at the After parameter for the composite aggregation.
Hope this is helpful! :D

Calculate millions adjacent records and summarize them in Elasticsearch

Id like to calculate millions of adjacent records and summarize them in the end in Elasticsearch. How can I do this?
Documents (six of them) data in Elasticsearch:
10
20
-30
10
30
100
Calculation:
10 to 20 is 10
20 to -30 is -50
-30 to 10 is 40
10 to 30 is 20
30 to 100 is 70
The total is:
10 + (-50) + 40 + 20 + 70 = 90
How would I do a query with REST - RestHighLevelClient API to achive this?
Generic case
Most likely the only reasonable way to do this in Elasticsearch is to denormalize and put into Elasticsearch already computed deltas. In this case you will only need a simple sum aggregation.
This is because data in Elasticsearch is "flat", so it does not know that your documents are adjacent. It excels when all you need to know is already in the document at index time: in this case special indexes are pre-built and aggregations are very fast.
It is like A'tuin, a flat version of Earth from Pratchett's novels: some basic physics, like JOINs from RDBMS, do not work, but magic is possible.
Time series-specific case
In case when you have a time series you can achieve your goal with a combination of Serial Differencing and Sum Bucket sibling aggregations.
In order to use this approach you would need to aggregate on some date field. Imagine you have a mapping like this:
PUT time_diff
{
"mappings": {
"doc": {
"properties": {
"eventTime": {
"type": "date"
},
"val": {
"type": "integer"
}
}
}
}
}
And a document per day which look like this:
POST /time_diff/doc/1
{
"eventTime": "2018-01-01",
"val": 10
}
POST /time_diff/doc/2
{
"eventTime": "2018-01-02",
"val": 20
}
Then with a query like this:
POST /time_diff/doc/_search
{
"size": 0,
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "eventTime",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "val"
}
},
"my_diff": {
"serial_diff": {
"buckets_path": "the_sum"
}
}
}
},
"my_sum": {
"sum_bucket": {
"buckets_path": "my_date_histo>my_diff"
}
}
}
}
The response will look like:
{
...
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2018-01-01T00:00:00.000Z",
"key": 1514764800000,
"doc_count": 1,
"my_delta": {
"value": 10
}
},
...
]
},
"my_sum": {
"value": 90
}
}
}
This method though has obvious limitations:
only works if you have time series data
only correct if you have exactly 1 data point per date bucket (a day in example)
will explode in memory consumption if you have many points (millions as you mentioned)
Hope that helps!

How to perform bucket filtering with ElasticSearch date histogram value_field

Trying to construct a date histogram with ElasticSearch logs of the following type:
{
"_index": "foo"
"_source": {
[…]
"time": "2013-06-12T14:43:13.238-07:00",
"userName": "bar"
}
}
where the histogram buckets the "time" field per "day" interval, but also where multiple occurrences of a single userName only gets counted once.
I have tried the following:
{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"date_histogram" : {
"key_field" : "time",
"value_script" : "doc['userName'].values.length",
"interval" : "day"
}
}
}
}
where I have expected the min|max|mean for each of the "histo1" entries to be the number of unique users in the respective time buckets. But the result consistently returns min = max = mean = 1
"histo1": {
"_type": "date_histogram",
"entries": [
{
"time": 1370908800000,
"count": 11,
"min": 1,
"max": 1,
"total": 11,
"total_count": 11,
"mean": 1
},
{
"time": 1370995200000,
"count": 18,
"min": 1,
"max": 1,
"total": 18,
"total_count": 18,
"mean": 1
}
]
}
Am I misunderstanding how key/values works in date histogram?
I ended up using the elasticsearch timefacets plugin: https://github.com/crate/elasticsearch-timefacets-plugin
Other options included:
https://github.com/bleskes/elasticfacets
https://github.com/ptdavteam/elasticsearch-approx-plugin
Both of them only have support for ES version < 0.90, unfortunately.

Resources