Elastic search aggregation on map - on each key - elasticsearch

I have following kind of documents.
document 1
{
"doc": {
"id": 1,
"errors": {
"e1":5,
"e2":20,
"e3":30
},
"warnings": {
"w1":1,
"w2":2
}
}
}
document 2
{
"doc": {
"id": 2,
"errors": {
"e1":10
},
"warnings": {
"w1":1,
"w2":2,
"w3":33,
}
}
}
I would like to get following sum stats in one or more calls. Is it possible? I tried various solution but all works when key is known. In my case map keys (e1, e2 etc) are not known.
{
"errors": {
"e1": 15,
"e2": 20,
"e3": 30
},
"warnings": {
"w1": 2,
"w2": 4,
"w3": 33
}
}

There are two solutions, none of them are pretty. I have to point out that the option 2 should be the preferred way to go since option 1 uses an experimental feature.
1. Dynamic mapping, [experimental] scripted aggregation
Inspired by this answer and the Scripted Metric Aggregation page of ES docs, I began with just inserting your documents to non-existing index (which by default creates dynamic mapping).
NB: I tested this on ES 5.4, but the documentation suggests that this feature is available from at least 2.0.
The resulting query for aggregation is the following:
POST /my_index/my_type/_search
{
"size": 0,
"query" : {
"match_all" : {}
},
"aggs": {
"errors": {
"scripted_metric": {
"init_script" : "params._agg.errors = [:]",
"map_script" : "for (t in params['_source']['doc']['errors'].entrySet()) { params._agg.errors[t.key] = t.value } ",
"combine_script" : "return params._agg.errors",
"reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } } return res"
}
},
"warnings": {
"scripted_metric": {
"init_script" : "params._agg.errors = [:]",
"map_script" : "for (t in params['_source']['doc']['warnings'].entrySet()) { params._agg.errors[t.key] = t.value } ",
"combine_script" : "return params._agg.errors",
"reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } } return res"
}
}
}
}
Which produces this output:
{
...
"aggregations": {
"warnings": {
"value": {
"w1": 2,
"w2": 4,
"w3": 33
}
},
"errors": {
"value": {
"e1": 15,
"e2": 20,
"e3": 30
}
}
}
}
If you are following this path you might be interested in the JavaDoc of what params['_source'] is underneath.
Warning: I believe that scripted aggregation is not efficient and for better performance you should check out the option 2 or a different data processing engine.
What does experimental mean:
This functionality is experimental and may be changed or removed
completely in a future release. Elastic will take a best effort
approach to fix any issues, but experimental features are not subject
to the support SLA of official GA features.
With this in mind we proceed to option 2.
2. Static nested mapping, nested aggregation
Here the idea is to store your data differently and essentially be able to query and aggregate it differently. Firstly, we need to create a mapping using nested data type.
PUT /my_index_nested/
{
"mappings": {
"my_type": {
"properties": {
"errors": {
"type": "nested",
"properties": {
"name": {"type": "keyword"},
"val": {"type": "integer"}
}
},
"warnings": {
"type": "nested",
"properties": {
"name": {"type": "keyword"},
"val": {"type": "integer"}
}
}
}
}
}
}
A document in such an index will look like this:
{
"_index": "my_index_nested",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"errors": [
{
"name": "e1",
"val": 5
},
{
"name": "e2",
"val": 20
},
{
"name": "e3",
"val": 30
}
],
"warnings": [
{
"name": "w1",
"val": 1
},
{
"name": "w2",
"val": 2
}
]
}
}
Next we need to write the aggregate query. First we need to use nested aggregation, which will allow us to query this special nested data type. But since we actually want to aggregate by name, and sum the values of val, we will need to do a sub-aggregation.
The resulting query is as follows (I am adding comments alongside the query for clarity):
POST /my_index_nested/my_type/_search
{
"size": 0,
"aggs": {
"errors_top": {
"nested": {
// declare which nested objects we want to work with
"path": "errors"
},
"aggs": {
"errors": {
// what we are aggregating - different values of name
"terms": {"field": "errors.name"},
// sub aggregation
"aggs": {
"error_sum": {
// sum all val for same name
"sum": {"field": "errors.val"}
}
}
}
}
},
"warnings_top": {
// analogous to errors
}
}
}
The output of this query will be like:
{
...
"aggregations": {
"errors_top": {
"doc_count": 4,
"errors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "e1",
"doc_count": 2,
"error_sum": {
"value": 15
}
},
{
"key": "e2",
"doc_count": 1,
"error_sum": {
"value": 20
}
},
{
"key": "e3",
"doc_count": 1,
"error_sum": {
"value": 30
}
}
]
}
},
"warnings_top": {
...
}
}
}

Related

ElasticSearch Max Agg on lowest value inside a list property of the document

I'm looking to do a Max aggregation on a value of the property under my document, the property is a list of complex object (key and value). Here's my data:
[{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
},
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}]
When I do the Nested Max Aggregation on "listItems.value", I'm expecting the max value returned to be 200 (and not 5000), reason being I want the logic to first figure the MIN value under listItems for each document, then doing the Max Aggregation on that. Is it possible to do something like this?
Thanks.
The search query performs the following aggregation :
Terms aggregation on the id field
Min aggregation on listItems.value
Max bucket aggregation that is a sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s).
Please refer to nested aggregation, to get a detailed explanation on it.
Adding a working example with index data, index mapping, search query, and search result.
Index Mapping:
{
"mappings": {
"properties": {
"listItems": {
"type": "nested"
},
"id":{
"type":"text",
"fielddata":"true"
}
}
}
}
Index Data:
{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
}
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.value"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": "2",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 200.0
}
}
}
]
},
"maxValue": {
"value": 200.0,
"keys": [
"2"
]
}
}
Initial post was mentioning nested aggregation, thus i was sure question is about nested documents. Since i've come to solution before seeing another answer, i'm keeping the whole thing for history, but actually it differs only in adding nested aggregation.
The whole process can be explained like that:
Bucket each document into single bucket.
Use nested aggregation to be able to aggregate on nested documents.
Use min aggregation to find minimum value within all document nested documents, and by that, for document itself.
Finally, use another aggregation to calculate maximum value among results of previous aggregation.
Given this setup:
// PUT /index
{
"mappings": {
"properties": {
"children": {
"type": "nested",
"properties": {
"value": {
"type": "integer"
}
}
}
}
}
}
// POST /index/_doc
{
"children": [
{ "value": 12 },
{ "value": 45 }
]
}
// POST /index/_doc
{
"children": [
{ "value": 7 },
{ "value": 35 }
]
}
I can use those aggregations in request to get required value:
{
"size": 0,
"aggs": {
"document": {
"terms": {"field": "_id"},
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"minimum": {
"min": {
"field": "children.value"
}
}
}
}
}
},
"result": {
"max_bucket": {
"buckets_path": "document>children>minimum"
}
}
}
}
{
"aggregations": {
"document": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O4QxyHQBK5VO9CW5xJGl",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 7.0
}
}
},
{
"key": "OoQxyHQBK5VO9CW5kpEc",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 12.0
}
}
}
]
},
"result": {
"value": 12.0,
"keys": [
"OoQxyHQBK5VO9CW5kpEc"
]
}
}
}
There also should be a workaround using script for calculating max - all that you will need to do is just find and return smallest value in document in such script.

Multiple key aggregation in ElasticSearch

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}

Elasticsearch aggregation by field name

Imagine two documents:
[
{
"_id": "abc",
"categories": {
"category-id-1": 1,
"category-id-2": 50
}
},
{
"_id": "def",
"categories": {
"category-id-1": 2
}
}
]
As you can see, each document can be associated with a number of categories, by setting a nested field into the categories field.
With this mapping, I should be able to request the documents from a defined category and to order them by the value set as value for this field.
My problem is that I now want to make an aggregation to count for each category the number of documents. That would give the following result for the dataset I provided:
{
"aggregations": {
"categories" : {
"buckets": [
{
"key": "category-id-1",
"doc_count": 2
},
{
"key": "category-id-2",
"doc_count": 1
}
]
}
}
}
I can't find anything in the documentation to solve this problem. I'm completely new to ElasticSearch so I may be doing something wrong either on my documentation research or on my mapping choice.
Is it possible to make this kind of aggregation with my mapping? I'm using ES 6.x
EDIT: Here is the mapping for the index:
{
"test1234": {
"mappings": {
"_doc": {
"properties": {
"categories": {
"properties": {
"category-id-1": {
"type": "long"
},
"category-id-2": {
"type": "long"
}
}
}
}
}
}
}
}
The most straightforward solution is to use a new field that contains all the distinct categories of a document.
If we call this field categories_list here could be a solution :
Change the mapping to
{
"test1234": {
"mappings": {
"_doc": {
"properties": {
"categories": {
"properties": {
"category-id-1": {
"type": "long"
},
"category-id-2": {
"type": "long"
}
}
},
"categories_list": {
"type": "keyword"
}
}
}
}
}
}
Then you need to modify your documents like this :
[
{
"_id": "abc",
"categories": {
"category-id-1": 1,
"category-id-2": 50
},
"categories_list": ["category-id-1", "category-id-2"]
},
{
"_id": "def",
"categories": {
"category-id-1": 2
},
"categories_list": ["category-id-1"]
}
]
then your aggregation request should be
{
"aggs": {
"categories": {
"terms": {
"field": "categories_list",
"size": 10
}
}
}
}
and will return
"aggregations": {
"categories": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category-id-1",
"doc_count": 2
},
{
"key": "category-id-2",
"doc_count": 1
}
]
}
}

Elasticsearch: Aggregate distinct values in array

I am using Elasticsearch to store click traffic and each row includes topics of the page which has been visited. A typical row looks like:
{
"date": "2017-09-10T12:26:53.998Z",
"pageid": "10263779",
"loc_ll": [
-73.6487,
45.4671
],
"ua_type": "Computer",
"topics": [
"Trains",
"Planes",
"Electric Cars"
]
}
I want each topics to be a keyword so if I search for cars nothing will be returned. Only Electric Cars would return a result.
I also want to run a distinct query on all topics in all rows so I have a list of all topics used.
Doing this on a pageid would look like like the following, but I am unsure how to approach this for the topics array.
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Your approach to querying and getting the available terms looks fine. Probably you should check your mapping. If you get results for cars this looks as your mapping for topics is an analyzed string (e.g. type text instead of keyword). So please check your mapping for this field.
PUT keywordarray
{
"mappings": {
"item": {
"properties": {
"id": {
"type": "integer"
},
"topics": {
"type": "keyword"
}
}
}
}
}
With this sample data
POST keywordarray/item
{
"id": 123,
"topics": [
"first topic", "second topic", "another"
]
}
and this aggregation:
GET keywordarray/item/_search
{
"size": 0,
"aggs": {
"topics": {
"terms": {
"field": "topics"
}
}
}
}
will result in this:
"aggregations": {
"topics": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "another",
"doc_count": 1
},
{
"key": "first topic",
"doc_count": 1
},
{
"key": "second topic",
"doc_count": 1
}
]
}
}
It is very therapeutic asking on SO. Simply changing the mapping type to keyword allowed me to achieve what I needed.
A part of me thought that it would concatenate the array into a string. But it doesn't
{
"mappings": {
"view": {
"properties": {
"topics": {
"type": "keyword"
},...
}
}
}
}
and a search query like
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Will return a distinct list of all elements in a fields array.

How do I compute for the fields of matching documents in Elasticsearch?

Here is my sample document:
{
"jobID": "ace4c888-1907-4021-a808-4a816e99aa2e",
"startTime": 1415255164835,
"endTime": 1415255164898,
"moduleCode": "STARTING_MODULE"
}
I have thousands of documents.
I have a pair of documents with the same jobID and the module code would be STARTING_MODULE and ENDING_MODULE.
My formula would be ENDING_MODULE endTime minus STARTING_MODULE startTime equals the elapsed time it took the module to process.
My question is: How do I get the total of all results with the elapsed time that is less than let's say 28800000?
Is such results possible with Elasticsearch? I'd like to display my results in Kibana too.
Please let me know if this needs more clarification. Thanks!
Try the following, might not be ideal, but it returns a jobID and the elapsed time. First I'm assuming jobID and moduleCode are not_analyzed:
{
"mappings": {
"jobs": {
"properties": {
"jobID":{
"type": "string",
"index": "not_analyzed"
},
"startTime":{
"type": "date"
},
"endTime":{
"type": "date"
},
"moduleCode":{
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Used scripted_metric aggregation available in ES 1.4.0 to compute the difference between those two values. Haven't looked into how to add the filtering for "less than 28800000", but I hope there can be something done with that script to limit this:
{
"query": {
"match_all": {}
},
"aggs": {
"jobIds": {
"terms": {
"field": "jobID"
},
"aggs": {
"executionTimes": {
"scripted_metric": {
"init_script": "_agg['time'] = 0L",
"map_script": "if (doc['moduleCode'].value == \"STARTING_MODULE\") { _agg['time']=-1*doc['startTime'].value } else { _agg['time']=doc['endTime'].value}",
"combine_script": "execution = 0; for (t in _agg.time) { execution += t };return execution",
"reduce_script": "execution = 0; for (a in _aggs) { execution += a }; return execution"
}
}
}
}
}
}
And the result should be something like this:
"aggregations": {
"jobIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ace4c888-1907-4021-a808-4a816e99aa1e",
"doc_count": 2,
"executionTimes": {
"value": 1
}
},
{
"key": "ace4c888-1907-4021-a808-4a816e99aa2e",
"doc_count": 2,
"executionTimes": {
"value": 1000201063
}
},
{
"key": "ace4c888-1907-4021-a808-4a816e99aa3e",
"doc_count": 2,
"executionTimes": {
"value": 10000
}
}
]
}
}

Resources