ElasticSearch aggregation using array items as keys - elasticsearch

Is it possible to create an aggregation by unnesting an array's elements to use as keys?
Here's an example:
Docs:
[
{
"languages": [ 1, 2 ],
"value": 100
},
{
"languages": [ 1 ],
"value": 50
}
]
its mapping:
{
"documents": {
"mappings": {
"properties": {
"languages": {
"type": "integer"
},
"value": {
"type": "integer"
}
}
}
}
}
and the expected output of a summing aggregation would be:
{
1: 150,
2: 100
}

You can achieve what you want by using a simple terms aggregation. Array elements will be bucketed individually:
POST index/_search
{
"aggs": {
"languages": {
"terms": {
"field": "languages"
},
"aggs": {
"total": {
"sum": {
"field": "value"
}
}
}
}
}
}
Results:
"aggregations" : {
"languages" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 2,
"total" : {
"value" : 150.0
}
},
{
"key" : 2,
"doc_count" : 1,
"total" : {
"value" : 100.0
}
}
]
}
}

The terms agg will sum up the # of occurences. What you want instead is a script to sum up the values based on the language array items as keys:
GET langs/_search
{
"size": 0,
"aggs": {
"lang_sums": {
"scripted_metric": {
"init_script": "state.lang_sums=[:]",
"map_script": """
for (def lang : doc['languages']) {
def lang_str = lang.toString();
def value = doc['value'].value;
if (state.lang_sums.containsKey(lang_str)) {
state.lang_sums[lang_str] += value;
} else {
state.lang_sums[lang_str] = value;
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
yielding
{
...
"aggregations":{
"lang_sums":{
"value":[
{
"lang_sums":{
"1":150,
"2":100
}
}
]
}
}
}

Related

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

Group by based on last entry of filter in Elastic Search

I have a scenario similar to the following:
An index that contains purchased items of a store, where each item have an order_id.
And I need to group by color of only last item of each order.
Data structure:
{
"order_id": 1,
"product_id":235233
"color": "Blue",
"purchase_date": "2020-08-21T05:53:43.362Z"
},
{
"order_id": 1,
"product_id":2352662
"color": "Black",
"purchase_date": "2020-08-23T05:53:43.362Z"
},
{
"order_id": 2,
"product_id":855477
"color": "Blue",
"purchase_date": "2020-08-22T05:53:43.362Z"
},
{
"order_id": 2,
"product_id":322352
"color": "Red",
"purchase_date": "2020-08-24T05:53:43.362Z"
},
{
"order_id": 3,
"product_id":3225235
"color": "Red",
"purchase_date": "2020-08-25T05:53:43.362Z"
}
Expected result
Black:1 (color of last product of order_id 1)
Red:2 (color of last products of order_id 2, 3)
Based on this answer, I could get last item of each order as whole item, but what I am looking for is getting items count per color directly
POST /items/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "order_id"
},
"aggs": {
"group_items": {
"top_hits": {
"size": 1,
"sort": [
{
"purchase_date": {
"order": "desc"
}
}
]
}
}
}
}
}
}
And the following gives me items count per color for all items of order, not just last one of each order.
GET /items/_search?search_type=count
{
"size":0,
"aggs": {
"colors": {
"terms": {
"field": "color.keyword"
}
}
}
}
An alternative approach to the problem would be to create and maintain a separate index (latest_by_order) that keeps track of the latest document for each order.
This can be achieved using transforms (see docs).
Such transform can be created using the following command:
PUT _transform/latest_by_order
{
"source": {
"index": "items"
},
"dest": {
"index": "latest_by_order"
},
"latest": {
"unique_key": ["order_id"],
"sort": "purchase_date"
}
}
Then, the secondary analysis can be done on top of the new (transformed) index.
The following request:
GET latest_by_order/_search
{
"size": 0,
"aggs": {
"count_by_color": {
"terms": {
"field": "color.keyword"
}
}
}
}
will yield the following response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"count_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Red",
"doc_count" : 2
},
{
"key" : "Black",
"doc_count" : 1
}
]
}
}
}
You could use group by color and order by the max of the purchase_date like so:
{
"size": 0,
"aggs": {
"group": {
"terms": {
"field": "color.keyword",
"order": {
"by_latest_purchase": "desc"
}
},
"aggs": {
"by_latest_purchase": {
"max": {
"field": "purchase_date"
}
}
}
}
}
}
but you'd still end up with blue b/c it's a color that exists in your docs and I don't know if it can be filtered away.
When in doubt (or all else fails), scripted metric aggregations to the rescue:
{
"size": 0,
"aggs": {
"by_color": {
"scripted_metric": {
"init_script": "state.by_order_id = [:]",
"map_script": """
def color = doc['color.keyword'].value;
def date = doc['purchase_date'].value.millis;
def order_id = doc['order_id'].value;
def current_group = ['color':color, 'date': date];
if (state.by_order_id.containsKey(order_id)) {
def max_group = state.by_order_id[order_id];
if (date > max_group.date) {
// we've found a new maximum
state.by_order_id[order_id] = current_group
}
} else {
state.by_order_id[order_id] = current_group;
}
""",
"combine_script": """
def colors_vs_count = [:];
for (def group : state.by_order_id.entrySet()) {
def order_id = group.getKey();
def color = group.getValue()['color'];
if (colors_vs_count.containsKey(color)) {
colors_vs_count[color]++;
} else {
colors_vs_count[color] = 1;
}
}
return colors_vs_count;
""",
"reduce_script": "return states"
}
}
}
}
yielding:
...
"aggregations" : {
"by_color" : {
"value" : [
{
"Red" : 2,
"Black" : 1
}
]
}
}
Here's a JSON-friendly, condensed version of the script:
{"size":0,"aggs":{"by_color":{"scripted_metric":{"init_script":"state.by_order_id = [:]","map_script":" def color = doc['color.keyword'].value;\n def date = doc['purchase_date'].value.millis;\n def order_id = doc['order_id'].value;\n \n def current_group = ['color':color, 'date': date];\n \n if (state.by_order_id.containsKey(order_id)) {\n def max_group = state.by_order_id[order_id];\n if (date > max_group.date) {\n state.by_order_id[order_id] = current_group\n }\n } else {\n state.by_order_id[order_id] = current_group;\n }","combine_script":" def colors_vs_count = [:];\n \n for (def group : state.by_order_id.entrySet()) {\n def order_id = group.getKey();\n def color = group.getValue()['color'];\n if (colors_vs_count.containsKey(color)) {\n colors_vs_count[color]++;\n } else {\n colors_vs_count[color] = 1;\n }\n }\n \n return colors_vs_count;","reduce_script":"return states"}}}}

How to aggregate until a certain value is reached in ElasticSearch?

I would like to aggregate a list of documents (each of them has two fields - timestamp and amount) by "amount" field until a certain value is reached. For example I would like to get list of documents sorted by timestamp which total amount is equal to 100. Is it possible to do in one query?
Here is my query which returns total amount - I would like to add here a condition to stop aggregation when a certain value is reached.
{
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": 1525168583
}
}
}
]
}
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
},
"sort": [
"timestamp"
],
"size": 10000
}
Thank You
It's perfectly possible using a combination of function_score scripting for mimicking sorting, filter aggs for the range gte query and a healthy amount of scripted_metric aggs to limit the summation up to a certain amount.
Let's first set up a mapping and ingest some docs:
PUT summation
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"format": "epoch_second"
}
}
}
}
POST summation/_doc
{
"context": "newest",
"timestamp": 1587049128,
"amount": 20
}
POST summation/_doc
{
"context": "2nd newest",
"timestamp": 1586049128,
"amount": 30
}
POST summation/_doc
{
"context": "3rd newest",
"timestamp": 1585049128,
"amount": 40
}
POST summation/_doc
{
"context": "4th newest",
"timestamp": 1585049128,
"amount": 30
}
Then perform the query:
GET summation/_search
{
"size": 0,
"aggs": {
"filtered_agg": {
"filter": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1585049128
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": "return (params['now'] - doc['timestamp'].date.toMillis())",
"params": {
"now": 1587049676
}
}
}
}
}
]
}
},
"aggs": {
"limited_sum": {
"scripted_metric": {
"init_script": """
state['my_hash'] = new HashMap();
state['my_hash'].put('sum', 0);
state['my_hash'].put('docs', new ArrayList());
""",
"map_script": """
if (state['my_hash']['sum'] <= 100) {
state['my_hash']['sum'] += doc['amount'].value;
state['my_hash']['docs'].add(doc['context.keyword'].value);
}
""",
"combine_script": "return state['my_hash']",
"reduce_script": "return states[0]"
}
}
}
}
}
}
yielding
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"filtered_agg" : {
"meta" : { },
"doc_count" : 4,
"limited_sum" : {
"value" : {
"docs" : [
"newest",
"2nd newest",
"3rd newest",
"4th newest"
],
"sum" : 120
}
}
}
}
}
I've chosen here to only return the doc.contexts but you can adjust it to retrieve whatever you like -- be it IDs, amounts etc.

Elastic Search: Selecting multiple vlaues in aggregates

In Elastic Search I have the following index with 'allocated_bytes', 'total_bytes' and other fields:
{
"_index" : "metrics-blockstore_capacity-2017_06",
"_type" : "datapoint",
"_id" : "AVzHwgsi9KuwEU6jCXy5",
"_score" : 1.0,
"_source" : {
"timestamp" : 1498000001000,
"resource_guid" : "2185d15c-5298-44ac-8646-37575490125d",
"allocated_bytes" : 1.159196672E9,
"resource_type" : "machine",
"total_bytes" : 1.460811776E11,
"machine" : "2185d15c-5298-44ac-8646-37575490125d"
}
I have the following query to
1)get a point for 30 minute interval using date-histogram
2)group by field on resource_guid.
3)max aggregate to find the max value.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValue": {
"max": {
"field": "allocated_bytes"
}
}
}
},
"sumUnique": {
"sum_bucket": {
"buckets_path": "groupByField>maxValue"
}
}
}
}
}
}
But with this query I am able to get only allocated_bytes, but I need to have both allocated_bytes and total_bytes at the result point.
Following is the result from the above query:
{
"key_as_string" : "2017-06-20T21:00:00.000Z",
"key" : 1497992400000,
"doc_count" : 9,
"groupByField" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "2185d15c-5298-44ac-8646-37575490125d",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
}, {
"key" : "c3513cdd-58bb-4f8e-9b4c-467230b4f6e2",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156165632E9
}
}, {
"key" : "eff13403-9737-4d08-9dca-fb6c12c3a6fa",
"doc_count" : 3,
"maxValue" : {
"value" : 1.156182016E9
}
} ]
},
"sumUnique" : {
"value" : 3.468529664E9
}
}
I do need both allocated_bytes and total_bytes. How do I get multiple fields( allocated_bytes, total_bytes) for each point?
For example:
"sumUnique" : {
"Allocatedvalue" : 3.468529664E9,
"TotalValue" : 9.468529664E9
}
or like this:
"allocatedBytessumUnique" : {
"value" : 3.468529664E9
}
"totalBytessumUnique" : {
"value" : 9.468529664E9
},
You can just add another aggregation:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1497992400000,
"lte": 1497996000000
}
}
}
]
}
},
"aggregations": {
"groupByTime": {
"date_histogram": {
"field": "timestamp",
"interval": "30m",
"order": {
"_key": "desc"
}
},
"aggregations": {
"groupByField": {
"terms": {
"size": 1000,
"field": "resource_guid"
},
"aggregations": {
"maxValueAllocated": {
"max": {
"field": "allocated_bytes"
}
},
"maxValueTotal": {
"max": {
"field": "total_bytes"
}
}
}
},
"sumUniqueAllocatedBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueAllocated"
}
},
"sumUniqueTotalBytes": {
"sum_bucket": {
"buckets_path": "groupByField>maxValueTotal"
}
}
}
}
}
}
I hope you are aware that sum_bucket calculates sibling aggregations only, in this case gives sum of max values, not the sum of total_bytes. If you want to get sum of total_bytes you can use sum aggregation

Resources