Count the number of elements in a list field in Elastic Search - elasticsearch

I'm still learning to use DSL queries in ElasticSearch. I have documents where one field is a list. I need to count the number of documents that have one element in this field, two elements in this field, etc. For example, here is a document structure:
Document1:
"Volume": [
{
"partition": "s1",
"fieldtype": ["A","B"]
}
]
Document 2:
"Volume": [
{
"partition": "s1",
"fieldtype": ["A"]
}
]
Document 3:
"Volume": [
{
"partition": "s1",
"fieldtype": ["B"]
}
]
I need a way to calculate that there is one document with 2 elements in fieldtype field and 2 documents with one element in fieldtype.
If I try to aggregate them like this:
"size":0,
"aggs": {
"name": {
"terms": {
"field": "fieldtype.keyword"
}
}
}
I get counts of elements (number of As and Bs). Without using keyword, I get an error.

#rabbitbr provided a good answer, but I could not understand why we tried to use a nested field. And, I think we need to use terms aggregation instead of sum here. Anyhow, here is a solution without nested :
PUT idx_test
POST idx_test/_bulk
{"index":{ "_id": 1}}
{"Volume":[{"partition": "s1","fieldtype": ["A","B"]}]}
{"index":{ "_id": 2}}
{"Volume":[{"partition": "s1","fieldtype": ["A"]}]}
{"index":{ "_id": 3}}
{"Volume":[{"partition": "s1","fieldtype": ["B"]}]}
GET idx_test/_mapping
GET idx_test/_search
{
"size": 0,
"aggs": {
"size": {
"terms": {
"script": {
"lang": "painless",
"source": "doc['Volume.fieldtype.keyword'].size()"
}
}
}
}
}
Without using keyword, I get an error.
This is normal because without keyword you are trying to build an aggregation on a field whose type is text.
Here the response for the query above which is pretty basic query :
{
....
"aggregations": {
"size": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 2
},
{
"key": "2",
"doc_count": 1
}
]
}
}
}
As you can see, we have 2 documents with 1 sized array and we have 1 document with 2 sized array.

I imagined that you work with nested type. Below is my solution:
PUT idx_test
{
"mappings": {
"properties": {
"Volume": {
"type": "nested"
}
}
}
}
POST idx_test/_bulk
{"index":{ "_id": 1}}
{"Volume":[{"partition": "s1","fieldtype": ["A","B"]}]}
{"index":{ "_id": 2}}
{"Volume":[{"partition": "s1","fieldtype": ["A"]}]}
{"index":{ "_id": 3}}
{"Volume":[{"partition": "s1","fieldtype": ["B"]}]}
GET idx_test/_search
{
"size": 0,
"aggs": {
"doc_id": {
"terms": {
"field": "_id",
"size": 10
},
"aggs": {
"volumes": {
"nested": {
"path": "Volume"
},
"aggs": {
"size": {
"sum": {
"script": {
"lang": "painless",
"source": "doc['Volume.fieldtype.keyword'].size()"
}
}
}
}
}
}
}
}
}
Respons:
"aggregations" : {
"doc_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 2.0
}
}
},
{
"key" : "2",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 1.0
}
}
},
{
"key" : "3",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 1.0
}
}
}
]
}
}

Related

Aggregation on Latest Records Of same status in ElasticSearch

I Have following data in ElasticSearch index some_index.
[ {
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "new",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "paid",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-02T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "new",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "paid",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
}
]
What I want to get is sum of the grandTotals by the latest cart_statuses of each cart within a given time range.
Having the example above, the result for timestamp >= 2022-12-01 00:00:00 and timestamp<= 2022-12-03 00:00:00 should be something like
cart_status:new, sum grandTotal: 40 because within that time range latest status new have cart_id 3 and 2.
and cart_status:paid, sum grandTotal: 12 and this one because paid is the latest status of only cart_id=1.
What I tried is to use sub-aggregation on top_result, top_hits but ElasticSearch complains that "Aggregator [top_result] of type [top_hits] cannot accept sub-aggregations"
Besides I tried with collapse as well to get the latest by status, but according to docs there is also no possibility to aggregate over the results of collapse.
Can someone please help me solving this, it seems like a common calculation but not very trivial in ElasticSearch.
In SQL this is quite easy with window functions.
I want to avoid persisting intermediate data into another index. Because I need the dynamic query, as the users may want to get their calculations for any time range.
you can try the following way. meanwhile, for card_status, sum value will be 52 as it includes card_id 1 that has "new" as card status along with 2 and 3 for given timestamp.
Mappings:
PUT some_index
{
"mappings" : {
"properties": {
"timestamp" : {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||strict_date_optional_time ||epoch_millis"
},
"cart_id" : {
"type": "keyword"
},
"cart_status" : {
"type": "keyword"
},
"grand_total" : {
"type": "long"
},
"event":{
"type": "keyword"
}
}
}
}
Bulk Insert:
POST _bulk
{ "index" : { "_index" : "some_index", "_id" : "1" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "2" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "paid","timestamp":"2022-12-02T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "3" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "4" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "paid","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "5" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "6" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
Query:
GET some_index/_search
{
"size":0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "2022-12-01 00:00:00",
"lte": "2022-12-03 00:00:00"
}
}
}
]
}
},
"aggs": {
"card_status": {
"terms": {
"field": "cart_status"
},
"aggs": {
"grandTotal": {
"sum": {
"field": "grand_total"
}
}
}
}
}
}
Output:
{
"took": 86,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"card_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 3,
"grandTotal": {
"value": 52
}
},
{
"key": "paid",
"doc_count": 1,
"grandTotal": {
"value": 12
}
}
]
}
}
}

Sort Aggregated Buckets From Nested Object Array By Specific Field

I have indexed documents such as
// doc 1
{
...,
"list": [{
"value": "a",
"order": 1
}, {
"value": "b",
"order": 2
}]
,...
}
// doc 2
{
...,
"list": [{
"value": "b",
"order": 2
}, {
"value": "c",
"order": 3
}]
,...
}
If I use the aggregation on the list.value:
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword"
}
}
}
}
I get buckets in order b, a, c:
"buckets" : [
{
"key" : "b",
"doc_count" : 2
},
{
"key" : "a",
"doc_count" : 1
},
{
"key" : "c",
"doc_count" : 1
}
]
as keys would be sorted by the _count in desc order.
If I use the aggregation on the list.value with sub-aggregation for sorting in form of max(list.order):
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword",
"order": { "max_order": "desc" }
},
"aggs": {
"max_order": { "max": { "field": "list.order" } }
}
}
}
}
I get buckets in order b, c, a
"buckets" : [
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 2.0
}
}
]
as both b and c have max order 3 in their lists of the object.
However, I want to write a query to get buckets in order c, b, a as their order is 3, 2, 1 respectively. How to achieve that?
You need to use nested aggregation, to get the buckets in order of c,b,a
Adding a working example with index data, mapping, search query and search result
Index Mapping
PUT testidx1
{
"mappings":{
"properties": {
"list":{
"type": "nested"
}
}
}
}
Index Data:
POST testidx1/_doc/1
{
"list": [
{
"value": "a",
"order": 1
},
{
"value": "b",
"order": 2
}
]
}
POST testidx1/_doc/2
{
"list": [
{
"value": "b",
"order": 2
},
{
"value": "c",
"order": 3
}
]
}
Search Query:
POST testidx1/_search
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "list"
},
"aggs": {
"unique_values": {
"terms": {
"field": "list.value.keyword",
"order": {
"max_order": "desc"
}
},
"aggs": {
"max_order": {
"max": {
"field": "list.order"
}
}
}
}
}
}
}
}
Search Response:
"aggregations" : {
"resellers" : {
"doc_count" : 4,
"unique_values" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 2.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 1.0
}
}
]
}
}
}
}

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

elasticsearch nested aggregation is empty

So, I have an index in Elasticsearch 7.6, which has documents similar to this one:
{
"_index": "my-index",
"_type": "_doc",
"_id": "kjdskjwolsjj",
"_version": 1,
"_score": null,
"_source": {
"timestamp": "2018-04-22T20:11:35.0292586Z",
"batchId": "9c96d360-5549-4b3b-85c8-756330117bad",
"userId": "id-001-001",
"things": [
{
"id": 650055867,
"name": "green",
},
{
"id": 523,
"name": "eggs",
},
{
"id": 1269,
"name": "ham",
}
]
}
}
Of course, this is just one document of many in the index. I would like to create an aggregate bucket of all the "things" in my index, so that I could sub aggregate against that bucket.
My agg query looks like this:
{
"aggs": {
"all_things": {
"nested": {
"path": "_source.things"
}
}
}
}
(BTW ... if I used just "things" as the nested path, it complains "[nested] nested path [things] is not nested".)
Finally the result (using the Kibana console) is:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1408,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"all_things" : {
"doc_count" : 0
}
}
}
Could someone explain why I get no docs in my bucket? Or perhaps a decent way to create a bucket of all my "things"?
Thanks.
You've gotta index your things as nested:
PUT my-index
{
"mappings": {
"properties": {
"things": {
"type": "nested"
}
}
}
}
POST my-index/_doc
{
"timestamp": "2018-04-22T20:11:35.0292586Z",
"batchId": "9c96d360-5549-4b3b-85c8-756330117bad",
"userId": "id-001-001",
"things": [
{
"id": 650055867,
"name": "green"
},
{
"id": 523,
"name": "eggs"
},
{
"id": 1269,
"name": "ham"
}
]
}
Then and only then will your nested aggs work:
GET my-index/_search
{
"size": 0,
"aggs": {
"things_ids": {
"nested": {
"path": "things"
},
"aggs": {
"things_ids": {
"cardinality": {
"field": "things.id"
}
}
}
}
}
}

Elasticsearch: Querying nested objects

Dear elasticsearch experts,
i have a problem querying nested objects. Lets use the following simplified mapping:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
And put some documents in the index:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
Now i want to execute queries like this:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
As result I want to get the matching companies with the number of matching documents. So the above query should give me:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
The following query:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
should give me all companies assigned to a document whichs title contains "GPU" with the number of matching documents:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
Is there any possibility with good performance to achieve this result? I'm explicitly not interested in the matching documents, only in the number of matched documents and the nested objects.
Thanks for your help.
What you need to do in terms of Elasticsearch is:
filter "parent" documents on desired criteria (like having GPU in title, or also mentioning Nvidia in the companies list);
group "nested" documents by a certain criteria, a bucket (e.g. company_id);
count how many "nested" documents there are per each bucket.
Each of the nested objects in the array are indexed as a separate hidden document, which complicates life a bit. Let's see how to aggregate on them.
So how to aggregate and count the nested documents?
You can achieve this with a combination of a nested, terms and top_hits aggregation:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
This will give the following output:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
Notice that for Nvidia we have "doc_count": 2.
But what if we want to count the number of "parent" objects who's got Nvidia vs Intel?
What if we want to count parent objects based on a nested bucket?
It can be achieved with reverse_nested aggregation.
We need to change our query just a little bit:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
The result will look like this:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
How can I spot the difference?
To make the difference evident, let's change the data a bit and add another Nvidia item in the document list:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
The last query (the one with reverse_nested) will give us the following:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
As you can see, this is a subtle difference that is hard to grasp, but it changes the semantics completely.
What's about performance?
While for most of the cases the performance of nested query and aggregations should be enough, of course it comes with a certain cost. It is therefore recommended to avoid using nested or parent-child types when tuning for search speed.
In Elasticsearch the best performance is often achieved through denormalization, although there is no single recipe and you should select the data model depending on your needs.
Hope this clarifies this nested thing for you a bit!

Resources