I have a question about aggregation.
I want to do aggregation for a field declared as an object array.
It is not aggregation for each element, but aggregation for the whole value.
I have following documents:
PUT value-list-index
{
"mappings": {
"properties": {
"server": {
"type": "keyword"
},
"users": {
"type": "keyword",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
PUT value-list-index/_doc/1
{
"server": "server1",
"users": ["user1"]
}
PUT value-list-index/_doc/2
{
"server": "server2",
"users": ["user1","user2"]
}
PUT value-list-index/_doc/3
{
"server": "server3",
"users": ["user2", "user3"]
}
PUT value-list-index/_doc/4
{
"server": "server4",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/5
{
"server": "server5",
"users": ["user2", "user3","user4"]
}
PUT value-list-index/_doc/6
{
"server": "server6",
"users": ["user3","user4"]
}
PUT value-list-index/_doc/7
{
"server": "server7",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/8
{
"server": "server8",
"users": ["user1","user2", "user3","user4"]
}
PUT value-list-index/_doc/9
{
"server": "server9",
"users": ["user1","user2", "user3","user4"]
}
get value-list-index/_search
{
"size" : 0,
"aggs": {
"words": {
"terms": {
"field": "users"
},
"aggs": {
"total": {
"value_count": {
"field": "users"
}
}
}
}
}
}
i want following
"aggregations" : {
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
**"key" : "user1",
"doc_count" : 1,**
"total" : {
"value" : xx
}
},
{
**"key" : "user1","user2",
"doc_count" : 1,**
"total" : {
"value" : xx
}
},
{
"key" : "user1","user2","user3","user4",
"doc_count" : 4,
"total" : {
"value" : xx
}
}
]
}
}
but return each element grouping result like this
"aggregations" : {
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "user2",
"doc_count" : 7,
"total" : {
"value" : 23
}
},
{
"key" : "user3",
"doc_count" : 7,
"total" : {
"value" : 23
}
},
{
"key" : "user1",
"doc_count" : 6,
"total" : {
"value" : 19
}
},
{
"key" : "user4",
"doc_count" : 6,
"total" : {
"value" : 21
}
}
]
}
}
Is the aggregation I want possible?
Maybe this aggs can help you: Frequent items aggregation
But be careful with the performance.
Look this results:
"aggregations": {
"words": {
"buckets": [
{
"key": {
"users": [
"user2"
]
},
"doc_count": 7,
"support": 0.7777777777777778
},
{
"key": {
"users": [
"user2",
"user3"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user3",
"user4"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user1"
]
},
"doc_count": 6,
"support": 0.6666666666666666
},
{
"key": {
"users": [
"user2",
"user3",
"user4"
]
},
"doc_count": 5,
"support": 0.5555555555555556
},
{
"key": {
"users": [
"user2",
"user1"
]
},
"doc_count": 5,
"support": 0.5555555555555556
},
{
"key": {
"users": [
"user2",
"user3",
"user4",
"user1"
]
},
"doc_count": 4,
"support": 0.4444444444444444
}
]
}
}
Related
I have documents with the following structure (very much simplified for the example):
"documents": [
{
"name": "Document 1",
"collections" : [
{
"id": 30,
"title" : "Research"
},
{
"id": 45,
"title" : "Events"
},
{
"id" : 52,
"title" : "International"
}
]
},
{
"name": "Document 2",
"collections" : [
{
"id": 45,
"title" : "Events"
},
{
"id" : 63,
"title" : "Development"
}
]
}
]
I want an aggregation of the collection. It works fine when I do it like this:
"aggs": {
"collections": {
"terms": {
"field": "collections.title",
"size": 30
}
}
}
I get a nice result as expected:
"aggregations" : {
"collections" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Research",
"doc_count" : 18
},
{
"key" : "Events",
"doc_count" : 14
},
{
"key" : "International",
"doc_count" : 13
},
{
"key" : "Development",
"doc_count" : 8
}
]
}
}
However, I want the id included as well. So I tried this:
"aggs": {
"collections": {
"terms": {
"field": "collections.title",
"size": 30
}
},
"aggs": {
"id": {
"terms": {
"field": "collections.id",
"size": 1
}
}
}
}
This is the result:
"aggregations" : {
"collections" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Research",
"doc_count" : 18,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "30",
"doc_count" : 1
}
]
}
},
{
"key" : "Events",
"doc_count" : 14,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "45",
"doc_count" : 1
}
]
}
},
{
"key" : "International",
"doc_count" : 13,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "52",
"doc_count" : 1
}
]
}
},
{
"key" : "Development",
"doc_count" : 8,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "45",
"doc_count" : 1
}
]
}
}
]
}
}
At glance it looks good. But at a closer look the at the last element with Development (scroll down). The id should be 63, but is 45.
I have vague idea why this is, but I cannot find a solution for it. I also tried the multi_terms, but it gives a similar result. I think the issue has to do with the fact there are multiple collections within the document.
Does anyone know the correct solution to solve this issue?
The reason is in an object type mapping there is no relation between "title" and "id" , everything is flatenned by Elasticsearch under the hood, so:
"collections" : [
{
"id": 30,
"title" : "Research"
},
{
"id": 45,
"title" : "Events"
},
{
"id" : 52,
"title" : "International"
}
]
Becomes:
"collections.id": [30,45,52],
"collections.title": [Research, Events, International]
Elasticsearch doesn't know id 30 belongs to Research, or id 45 to Events.
You must use "nested" type to keep the relation between nested properties.
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
Solution: Use nested field type
Mappings
PUT test_nestedaggs
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"collections": {
"type": "nested",
"properties": {
"title": {
"type": "keyword"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Documents
POST test_nestedaggs/_doc
{
"name": "Document 1",
"collections": [
{
"id": 30,
"title": "Research"
},
{
"id": 45,
"title": "Events"
},
{
"id": 52,
"title": "International"
}
]
}
POST test_nestedaggs/_doc
{
"name": "Document 2",
"collections": [
{
"id": 45,
"title": "Events"
},
{
"id": 63,
"title": "Development"
}
]
}
Query
POST test_nestedaggs/_search?size=0
{
"aggs": {
"nested_collections": {
"nested": {
"path": "collections"
},
"aggs": {
"collections": {
"terms": {
"field": "collections.title"
},
"aggs": {
"ids": {
"terms": {
"field": "collections.id"
}
}
}
}
}
}
}
}
Results
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"nested_collections": {
"doc_count": 5,
"collections": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Events",
"doc_count": 2,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "45",
"doc_count": 2
}
]
}
},
{
"key": "Development",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "63",
"doc_count": 1
}
]
}
},
{
"key": "International",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "52",
"doc_count": 1
}
]
}
},
{
"key": "Research",
"doc_count": 1,
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "30",
"doc_count": 1
}
]
}
}
]
}
}
}
}
You can read an article I wrote about that for details:
https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/
NOTE: If the number of child documents is too big and you are doing a lot of updates, consider changing the data model because each child document is an independent document in the index, and on each update on a child document the whole structure will reindex and that may affect the performance, there are also limits in the maximum of nested documents you can add. If the number is small like the example then it's fine.
My Elasticsearch index contains products with a denormalized m:n relationship to categories.
My goal is to derive a categories index from it which contains the same information, but with the relationship inverted.
The index looks like this:
PUT /products
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"article_id": {
"type": "keyword"
},
"categories": {
"type": "nested",
"properties": {
"cat_name": {
"type": "keyword"
}
}
}
}
}
}
containing documents created like this:
POST /products/_doc
{
"name": "radio",
"article_id": "1001",
"categories": [
{ "cat_name": "audio" },
{ "cat_name": "electronics" }
]
}
POST /products/_doc
{
"name": "fridge",
"article_id": "1002",
"categories": [
{ "cat_name": "appliances" },
{ "cat_name": "electronics" }
]
}
I would like to get something like this back from Elasticsearch:
{
"name": "appliances",
"products": [
{
"name": "fridge",
"article_id": "1002"
}
]
},
{
"name": "audio",
"products": [
{
"name": "radio",
"article_id": "1001"
}
]
},
{
"name": "electronics",
"products": [
{
"name": "fridge",
"article_id": "1002"
},
{
"name": "radio",
"article_id": "1001"
}
]
}
which would eventually be put into an index such as:
PUT /categories
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"products": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"article_id": {
"type": "keyword"
}
}
}
}
}
}
I cannot figure out how to do this without loading and grouping all products programmatically.
Here's what I have tried:
Bucket aggregation on field categories.cat_name
This gives me the document count per category but not the product documents. Using top_hits sub-aggregation seems to be limited to 100 documents.
Group using collapse field with expansion
Collapsing is only possible on a single-valued field.
I'm using Elasticsearch 8.1.
The query you need is this one:
POST products/_search
{
"size": 0,
"aggs": {
"cats": {
"nested": {
"path": "categories"
},
"aggs": {
"categories": {
"terms": {
"field": "categories.cat_name",
"size": 10
},
"aggs": {
"root": {
"reverse_nested": {},
"aggs": {
"products": {
"terms": {
"field": "name",
"size": 10
}
}
}
}
}
}
}
}
}
}
Which produces exactly what you need (less the article id, but that's easy):
"buckets" : [
{
"key" : "electronics",
"doc_count" : 2,
"root" : {
"doc_count" : 2,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "fridge",
"doc_count" : 1
},
{
"key" : "radio",
"doc_count" : 1
}
]
}
}
},
{
"key" : "appliances",
"doc_count" : 1,
"root" : {
"doc_count" : 1,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "fridge",
"doc_count" : 1
}
]
}
}
},
{
"key" : "audio",
"doc_count" : 1,
"root" : {
"doc_count" : 1,
"products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "radio",
"doc_count" : 1
}
]
}
}
}
]
I have use case where I need to get all unique user ids from Elasticsearch and it should be sorted by timestamp.
What I'm using currently is composite term aggregation with sub aggregation which will return the latest timestamp.
(I can't sort it in client side as it slow down the script)
Sample data in elastic search
{
"_index": "logstash-2020.10.29",
"_type": "doc",
"_id": "L0Urc3UBttS_uoEtubDk",
"_version": 1,
"_score": null,
"_source": {
"#version": "1",
"#timestamp": "2020-10-29T06:56:00.000Z",
"timestamp_string": "1603954560",
"search_query": "example 3",
"user_uuid": "asdfrghcwehf",
"browsing_url": "https://www.google.com/search?q=example+3",
},
"fields": {
"#timestamp": [
"2020-10-29T06:56:00.000Z"
]
},
"sort": [
1603954560000
]
}
Expected Output:
[
{
"key" : "bjvexyducsls",
"doc_count" : 846,
"1" : {
"value" : 1.603948557E12,
"value_as_string" : "2020-10-29T05:15:57.000Z"
}
},
{
"key" : "lhmsbq2osski",
"doc_count" : 420,
"1" : {
"value" : 1.6039476E12,
"value_as_string" : "2020-10-29T05:00:00.000Z"
}
},
{
"key" : "m2wiaufcbvvi",
"doc_count" : 1,
"1" : {
"value" : 1.603893635E12,
"value_as_string" : "2020-10-28T14:00:35.000Z"
}
},
{
"key" : "rrm3vd5ovqwg",
"doc_count" : 1,
"1" : {
"value" : 1.60389362E12,
"value_as_string" : "2020-10-28T14:00:20.000Z"
}
},
{
"key" : "x42lk4t3frfc",
"doc_count" : 72,
"1" : {
"value" : 1.60389318E12,
"value_as_string" : "2020-10-28T13:53:00.000Z"
}
}
]
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"user":{
"type":"keyword"
},
"date":{
"type":"date"
}
}
}
}
Index Data:
{
"date": "2015-01-01",
"user": "user1"
}
{
"date": "2014-01-01",
"user": "user2"
}
{
"date": "2015-01-11",
"user": "user3"
}
Search Query:
{
"size": 0,
"aggs": {
"user_id": {
"terms": {
"field": "user",
"order": {
"sort_user": "asc"
}
},
"aggs": {
"sort_user": {
"min": {
"field": "date"
}
}
}
}
}
}
Search Result:
"aggregations": {
"user_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "user2",
"doc_count": 1,
"sort_user": {
"value": 1.3885344E12,
"value_as_string": "2014-01-01T00:00:00.000Z"
}
},
{
"key": "user1",
"doc_count": 1,
"sort_user": {
"value": 1.4200704E12,
"value_as_string": "2015-01-01T00:00:00.000Z"
}
},
{
"key": "user3",
"doc_count": 1,
"sort_user": {
"value": 1.4209344E12,
"value_as_string": "2015-01-11T00:00:00.000Z"
}
}
]
}
How to go about bucketing on a field and then aggregating all the values of a different field into an array. Here's a sample list.
{
"product": "xyz",
"action": "add",
"user": "bob"
},
{
"product": "xyz",
"action": "update",
"user": "bob"
},
{
"product": "xyz",
"action": "add",
"user": "alice"
},
{
"product": "xyz",
"action": "add",
"user": "eve"
},
{
"product": "xyz",
"action": "delete",
"user": "eve"
}
Expected output:
{
"buckets": [
{
"key": "add",
"doc_count": 3,
"user": ["bob", "alice", "eve"]
},
{
"key": "update",
"doc_count": 1,
"user": ["bob"]
},
{
"key": "delete",
"doc_count": 1,
"user": ["eve"]
}
]
}
How to push user values to an array in each bucket? Is there something similar to mongodb $push or $addToFields in elastic aggregation? Appreciate the help.
Here's the work-in-progress aggregation.
{
"size": 0,
"aggs": {
"product_filter": {
"filter": {
"term": {
"product": "xyz"
}
},
"aggs": {
"group_by_action": {
"terms": {
"field": "action",
"size":1000,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
Would this do? I just added chained one more Terms Aggregation as mentioned below:
Aggregation Query:
POST <your_index_name>
{
"size": 0,
"aggs": {
"product_filter": {
"filter": {
"term": {
"product": "xyz"
}
},
"aggs": {
"group_by_action": {
"terms": {
"field": "action",
"size":1000,
"order": {
"_count": "desc"
}
},
"aggs": {
"myUsers": {
"terms": {
"field": "user",
"size": 10
}
}
}
}
}
}
}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"product_filter" : {
"doc_count" : 5,
"group_by_action" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "add",
"doc_count" : 3,
"myUsers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "alice",
"doc_count" : 1
},
{
"key" : "bob",
"doc_count" : 1
},
{
"key" : "eve",
"doc_count" : 1
}
]
}
},
{
"key" : "delete",
"doc_count" : 1,
"myUsers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "eve",
"doc_count" : 1
}
]
}
},
{
"key" : "update",
"doc_count" : 1,
"myUsers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "bob",
"doc_count" : 1
}
]
}
}
]
}
}
}
}
I'm not sure if it is possible to have them in a single list as you've mentioned.
Hope this helps!
Say we have following documents in elasticsearch:
[{
"person": {
'name': 'asqar'
},
"bill": [
{
code:2,
value: 210
},
{
code:3,
value: 330
},
{
code:8,
value: 220
},
]
},
{
"person": {
'name': 'asqar'
},
"bill": [
{
code:2,
value: 340
},
{
code:4,
value: 340
},
{
code:1,
value: 200
},
]
},
{
"person": {
'name': 'asqar'
},
"bill": [
{
code:2,
value: 810
},
{
code:4,
value: 630
},
{
code:8,
value: 220
},
]
}]
I want to apply aggregate function on specific object in the bill array iwth some condition, for example I want calculate avg of value which its code is 2.
Field bill needs to be created as nested object to filter on it.
You can then use filter aggregation
Mapping:
PUT testindex/_mapping
{
"properties": {
"person": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"bill": {
"type": "nested",
"properties": {
"code": {
"type": "integer"
},
"value":{
"type": "double"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "422tAWsBd-1D6Ztt1_Tb",
"_score" : 1.0,
"_source" : {
"person" : {
"name" : "asqar"
},
"bill" : [
{
"code" : 2,
"value" : 210
},
{
"code" : 3,
"value" : 330
},
{
"code" : 8,
"value" : 220
}
]
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "5G2uAWsBd-1D6ZttpfR9",
"_score" : 1.0,
"_source" : {
"person" : {
"name" : "asqar"
},
"bill" : [
{
"code" : 2,
"value" : 340
},
{
"code" : 4,
"value" : 340
},
{
"code" : 1,
"value" : 200
}
]
}
},
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "5W2vAWsBd-1D6ZttQfQ_",
"_score" : 1.0,
"_source" : {
"person" : {
"name" : "asqar"
},
"bill" : [
{
"code" : 2,
"value" : 810
},
{
"code" : 4,
"value" : 630
},
{
"code" : 8,
"value" : 220
}
]
}
}
]
Query:
GET testindex/_search
{
"size": 0,
"aggs": {
"terms_agg": {
"terms": {
"field": "person.name.keyword"
},
"aggs": {
"bill": {
"nested": {
"path": "bill"
},
"aggs": {
"bill_code": {
"filter": {
"term": {
"bill.code": 2
}
},
"aggs": {
"average": {
"avg": {
"field": "bill.value"
}
}
}
}
}
}
}
}
}
}
Output:
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "asqar",
"doc_count" : 3,
"bill" : {
"doc_count" : 9,
"bill_code" : {
"doc_count" : 3,
"average" : {
"value" : 453.3333333333333
}
}
}
}
]
}
}
You need to first make sure that the bill field is of nested type. Then you can use nested aggregation to deal with nested documents. You can use terms aggregation on bill.code and a child avg aggregation on field bill.value to this terms aggregation. This will give you average value for each code. Now since you want only aggregation against the code 2, you can make use of bucket selector aggregation to filter and get only bucket with code 2.
So the final aggregation query will look as below:
{
"aggs": {
"VALUE_NESTED": {
"nested": {
"path": "bill"
},
"aggs": {
"VALUE_TERM": {
"terms": {
"field": "bill.code"
},
"aggs": {
"VALUE_AVG": {
"avg": {
"field": "bill.value"
}
},
"CODE": {
"max": {
"field": "bill.code"
}
},
"CODE_FILTER": {
"bucket_selector": {
"buckets_path": {
"code": "CODE"
},
"script": "params.code == 2"
}
}
}
}
}
}
}
}
Sample o/p for above:
"aggregations": {
"VALUE_NESTED": {
"doc_count": 9,
"VALUE_TERM": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 3,
"CODE": {
"value": 2
},
"VALUE_AVG": {
"value": 453.3333333333333
}
}
]
}
}
}