Elasticsearch Terms Aggregation - for dynamic keys of an object - elasticsearch

Documents Structure
Doc_1 {
"title":"hello",
"myObject":{
"key1":"value1",
"key2":"value2"
}
}
Doc_2 {
"title":"hello world",
"myObject":{
"key2":"value4",
"key3":"value3"
}
}
Doc_3 {
"title":"hello world2",
"myObject":{
"key1":"value1",
"key3":"value3"
}
}
Information: myObject contains dynamic key-value pair.
Objective: My objective is to write an aggregation query to return the number of unique all dynamic key-value pairs.
Attempt and explanation: I can easily get results for known keys in this way.
{
"size":0,
"query":{
"match":{"title":"hello"}
},
"aggs":{
"key1Agg":{
"terms":{"field":"myObject.key1.keyword"}
},
"key2Agg":{
"terms":{"field":"myObject.key2.keyword"}
},
"key3Agg":{
"terms":{"field":"myObject.key3.keyword"}
}
}
}
This is the typical result of the above hardcoded nested keys aggregation.
{
...
"aggregations": {
"key1Agg": {
...
"buckets": [
{
"key": "value1",
"doc_count": 2
}
]
},
"key2Agg": {
...
"buckets": [
{
"key": "value2",
"doc_count": 1
},
{
"key": "value4",
"doc_count": 1
}
]
},
"key3Agg": {
...
"buckets": [
{
"key": "value3",
"doc_count": 2
}
]
}
}
}
Now all I want is to return the count of all dynamic key-value pairs, i.e without putting any hardcore key names in an aggregation query.
I am using ES 6.3, Thanks in Advance!!

From the information you have provided, it appears that myObject seems to be of object datatype and not nested datatype.
Well, there is no easy way to do without modifying the data you have, what you can do and possibly the simplest solution would be is to include an additional field say let's call it as myObject_list which would be of type keyword where the documents would be as follows:
Sample Documents:
POST test_index/_doc/1
{
"title":"hello",
"myObject":{
"key1":"value1",
"key2":"value2"
},
"myObject_list": ["key1_value1", "key2_value2"] <--- Note this
}
POST test_index/_doc/2
{
"title":"hello world",
"myObject":{
"key2":"value4",
"key3":"value3"
},
"myObject_list": ["key2_value4", "key3_value3"] <--- Note this
}
POST test_index/_doc/3
{
"title":"hello world2",
"myObject":{
"key1":"value1",
"key3":"value3"
},
"myObject_list": ["key1_value1", "key3_value3"] <--- Note this
}
You can have a query as simple as below:
Request Query:
POST test_index/_search
{
"size": 0,
"aggs": {
"key_value_aggregation": {
"terms": {
"field": "myObject_list", <--- Make sure this is of keyword type
"size": 10
}
}
}
}
Note that I've used Terms Aggregation over here.
Response:
{
"took" : 406,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"key_value_aggregation" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "key1_value1",
"doc_count" : 2
},
{
"key" : "key3_value3",
"doc_count" : 2
},
{
"key" : "key2_value2",
"doc_count" : 1
},
{
"key" : "key2_value4",
"doc_count" : 1
}
]
}
}
}
Hope this helps!

Related

Can Elastic Search do aggregations for within a document?

I have a mapping like this:
mappings: {
"seller": {
"properties" : {
"overallRating": {"type" : byte}
"items": [
{
itemName: {"type": string},
itemRating: {"type" : byte}
}
]
}
}
}
Each item will only have one itemRating. Each seller will only have one overall rating. There can be many items, and at most I'm expecting maybe 50 items with itemRatings. Not all items have to have an itemRating.
I'm trying to get an average rating for each seller that combines all itemRatings and the overallRating. I have looked into aggregations but all I have seen are aggregations for across all documents. The aggregation I'm looking to do is within the document itself, and I am not sure if that is possible. Any tips would be appreciated.
Yes this is very much possible with Elasticeasrch. To produce a combined rating, you simply need to subaggregate by the document id. The only thing present in the bucket would be the individual document . That is what you want.
Here is an example:
Create the index:
PUT /ratings
{
"mappings": {
"properties": {
"overallRating": {"type" : "float"},
"items": {
"type" : "nested",
"properties": {
"itemName" : {"type" : "keyword"},
"itemRating" : {"type" : "float"},
"overallRating": {"type" : "float"}
}
}
}
}
}
Add some data:
POST ratings/_doc/
{
"overallRating" : 1,
"items" : [
{
"itemName" : "labrador",
"itemRating" : 10,
"overallRating" : 1
},
{
"itemName" : "saint bernard",
"itemRating" : 20,
"overallRating" : 1
}
]
}
{
"overallRating" : 1,
"items" : [
{
"itemName" : "cat",
"itemRating" : 5,
"overallRating" : 1
},
{
"itemName" : "rat",
"itemRating" : 10,
"overallRating" : 1
}
]
}
Query the index for a combined rating and sort by the rating:
GET ratings/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_rating": {
"composite": {
"sources": [
{
"ids": {
"terms": {
"field": "_id"
}
}
}
]
},
"aggs": {
"average_rating": {
"nested": {
"path": "items"
},
"aggs": {
"avg": {
"avg": {
"field": "items.compound"
}
}
}
}
}
}
},
"runtime_mappings": {
"items.compound": {
"type": "double",
"script": {
"source": "emit(doc['items.overallRating'].value + doc['items.itemRating'].value)"
}
}
}
}
The result (Pls note that i changed the exact values of ratings between writing the answer and running it in the console, so the averages are a bit different)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_rating" : {
"after_key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"buckets" : [
{
"key" : {
"ids" : "3_Up44EBbR3hrRYkLsrC"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 151.0
}
}
},
{
"key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 8.5
}
}
}
]
}
}
}
One change for convenience:
I edited your mappings to add the overAllRating to each Item entry. This simplifies the calculations that come subsequently, simply because you only look in the nested scope and never have to step out.
I also had to use a "runtime mapping" to combine the value of each overAllRating and ItemRating, to produce a better average. I basically made a sum of every ItemRating with the OverAllRating and averaged those across every entry.
I had to use a top level composite "id" aggregation so that we only get results per document (which is what you want).
There is some pretty heavy lifting happening here, but it is very possible and easy to edit this as you require.
HTH.

How to use composite aggregation with a single bucket

The following composite aggregation query
{
"query": {
"range": {
"orderedAt": {
"gte": 1591315200000,
"lte": 1591438881000
}
}
},
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{
"aggregation_target": {
"terms": {
"field": "supplierId"
}
}
}
]
},
"aggs": {
"aggregated_hits": {
"top_hits": {}
},
"filter": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count > 2"
}
}
}
}
}
}
returns something like below.
{
"took" : 67,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 34,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_buckets" : {
"after_key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"buckets" : [
{
"key" : {
"aggregation_target" : "0HQI2G0K000100G8"
},
"doc_count" : 4,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G18G00100G8"
},
"doc_count" : 11,
"aggregated_hits" : {...}
},
{
"key" : {
"aggregation_target" : "0HQI2G2HG00100G8"
},
"doc_count" : 16,
"aggregated_hits" : {...}
}
]
}
}
}
The aggregated results are put into buckets based on the condition set in the query.
Is there any way to put them in a single bucket and paginate thought the whole result(i.e. 31 documents in this case)?
I don't think you can. A doc's context doesn't include information about other docs unless you perform a cardinality, scripted_metric or terms aggregation. Also, once you bucket your docs based on the supplierId, it'd sort of defeat the purpose of aggregating in the first place...
What you wrote above is as good as it gets and you'll have to combine the aggregated_hits within some post processing step.

Is it possible with aggregation to amalgamate all values of an array property from all grouped documents into the coalesced document?

I have documents with the format similar to the following:
[
{
"name": "fred",
"title": "engineer",
"division_id": 20
"skills": [
"walking",
"talking"
]
},
{
"name": "ed",
"title": "ticket-taker",
"division_id": 20
"skills": [
"smiling"
]
}
]
I would like to run an aggs query that would show the complete set of skills for the division: ie,
{
"aggs":{
"distinct_skills":{
"cardinality":{
"field":"division_id"
}
}
},
"_source":{
"includes":[
"division_id",
"skills"
]
}
}
.. so that the resulting hit would look like:
{
"division_id": 20,
"skills": [
"walking",
"talking",
"smiling"
]
}
I know I can retrieve inner_hits and iterate through the list and amalgamate values "manually". I assume it would perform better if I could do it a query.
Just pipe two Terms Aggregation queries as shown below:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_division_ids": {
"terms": {
"field": "division_id",
"size": 10
},
"aggs": {
"my_skills": {
"terms": {
"field": "skills", <---- If it is not keyword field use `skills.keyword` field if using dynamic mapping.
"size": 10
}
}
}
}
}
}
Below is the sample response:
Response:
{
"took" : 490,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_division_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 20, <---- division_id
"doc_count" : 2,
"my_skills" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ <---- Skills
{
"key" : "smiling",
"doc_count" : 1
},
{
"key" : "talking",
"doc_count" : 1
},
{
"key" : "walking",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope this helps!

In Elasticsearch, how do I perform nested sub-aggregations?

In Kibana, I create my index as follows:
PUT cars
{
"mappings":{
"_doc":{
"properties":{
"metadata":{
"type":"nested",
"properties":{
"str_value":{
"type":"keyword"
}
}
}
}
}
}
}
I then insert three records:
POST /cars/_doc/1
{
"metadata": [
{
"key": "model",
"str_value": "Ford"
},
{
"key": "price",
"int_value": 1000
}
]
}
PUT /cars/_doc/2
{
"metadata": [
{
"key": "model",
"str_value": "Ford"
},
{
"key": "price",
"int_value": 2000
}
]
}
PUT /cars/_doc/3
{
"metadata": [
{
"key": "model",
"str_value": "Holden"
},
{
"key": "price",
"int_value": 2500
}
]
}
The schema is a bit unconventional, but I've designed the index this way to avoid mapping explosion:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
What I'd like to be able to do is to get all my car models, and the sum of prices for those models ie Ford $3000, and Holden $2500. So far I have:
GET /cars/_search
{
"aggs":{
"metadata":{
"nested":{
"path":"metadata"
},
"aggs":{
"model_filter":{
"filter":{
"term":{
"metadata.key":"model"
}
},
"aggs":{
"model_counter":{
"terms":{
"field":"metadata.str_value",
"size":1000
}
}
}
}
}
}
}
}
This gets me part of the way there, because it returns car models and document counts:
"aggregations": {
"metadata": {
"doc_count": 6,
"model_filter": {
"doc_count": 3,
"model_counter": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Ford",
"doc_count": 2
},
{
"key": "Holden",
"doc_count": 1
}
]
}
}
}
}
How can I modify my query to add a sub-aggregation which will show the sum of prices ie 3000 for Ford (sum of two documents) and 2500 for Holden (sum of one document)
Below query should help you for what you are looking for.
I've simply added on to your solution for that. I've made use of Reverse Nested Aggregation and then applied Sum Aggregation post again using Nested Aggregation.
So your query hierarchy is as below:
Nested Aggregation
- Terms Aggregation
- Reverse Nested Aggregation to back to parent doc
- Nested Aggregation to enter into nested price document
- Sum Aggregation to calculate all the prices
Aggregation Query:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"metadata":{
"nested":{
"path":"metadata"
},
"aggs":{
"model_filter":{
"filter":{
"term":{
"metadata.key":"model"
}
},
"aggs":{
"model_counter":{
"terms":{
"field":"metadata.str_value",
"size":1000
},
"aggs":{
"reverseNestedAgg":{
"reverse_nested":{},
"aggs":{
"metadata":{
"nested":{
"path":"metadata"
},
"aggs":{
"sum":{
"sum":{
"field":"metadata.int_value"
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Note that I've added "size": 0 so as to only return aggregation query. You can modify it according to your requirements.
Aggregation Solution:
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"metadata" : {
"doc_count" : 6,
"model_filter" : {
"doc_count" : 3,
"model_counter" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Ford",
"doc_count" : 2,
"reverseNestedAgg" : {
"doc_count" : 2,
"metadata" : {
"doc_count" : 4,
"sum" : {
"value" : 3000.0
}
}
}
},
{
"key" : "Holden",
"doc_count" : 1,
"reverseNestedAgg" : {
"doc_count" : 1,
"metadata" : {
"doc_count" : 2,
"sum" : {
"value" : 2500.0
}
}
}
}
]
}
}
}
}
}
Note that I've tested the above query in ES version 7.
Important Note:
If your document ends up in the below format, then the above query wouldn't work.
POST /cars/_doc/1
{
"metadata": [
{
"key": "model",
"str_value": "Ford"
},
{
"key": "price",
"int_value": 1000
},
{
"key": "something else",
"int_value": 1000
}
]
}
// There are three nested documents with two documents having int_value field
I see you mentioned that you'd want to avoid mapping explosion and for that matter your schema is the way it is. However if the above scenario occurs, in that case you may want to take a step back and redesign your model or have your service layer handle this aggregation scenario.
Hope this helps!

filtering on 2 values of same field

I have a status field, which can have one of the following values,
I can filter for data which have status completed. I can also see data which has ongoing.
But I want to display the data which have status completed and ongoing at the same time.
But I don't know how to add filters for 2 values on a single field.
How can I achieve what I want ?
EDIT - Thanks for answers. But that is not what i wanted.
Like here I have filtered for status:completed, I want to filter for 2 values in this exact way.
I know I can edit this filter and , and use your queries, But I need a simple way to do this(query way is complex), as I have to show it to my marketing team and they don't have any idea about queries. I need to convince them.
If I understand your question correctly, you want to perform an aggregation on 2 values of a field.
This should be possible with a query similar to this one with a terms query:
{
"size" : 0,
"query" : {
"bool" : {
"must" : [ {
"terms" : {
"status" : [ "completed", "unpaid" ]
}
} ]
}
},
"aggs" : {
"freqs" : {
"terms" : {
"field" : "status"
}
}
}
}
This will give a result like this one:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"freqs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "unpaid",
"doc_count" : 4
}, {
"key" : "completed",
"doc_count" : 1
} ]
}
}
}
Here is my toy mapping definition:
{
"bookings" : {
"properties" : {
"status" : {
"type" : "keyword"
}
}
}
}
You need a filter in aggregation.
{
"size": 0,
"aggs": {
"agg_name": {
"filter": {
"bool": {
"should": [
{
"terms": {
"status": [
"completed",
"ongoing"
]
}
}
]
}
}
}
}
}
Use the above query to get results like this:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"agg_name": {
"doc_count": 6
}
}
}
The result what you want is the doc_count
For your reference bool query in elasticsearch, should it's like OR conditions,
{
"query":{
"bool":{
"should":[
{"must":{"status":"completed"}},
{"must":{"status":"ongoing"}}
]
}
},
"aggs" : {
"booking_status" : {
"terms" : {
"field" : "status"
}
}
}
}

Resources