Elasticsearch: Querying nested objects - elasticsearch

Dear elasticsearch experts,
i have a problem querying nested objects. Lets use the following simplified mapping:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
And put some documents in the index:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
Now i want to execute queries like this:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
As result I want to get the matching companies with the number of matching documents. So the above query should give me:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
The following query:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
should give me all companies assigned to a document whichs title contains "GPU" with the number of matching documents:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
Is there any possibility with good performance to achieve this result? I'm explicitly not interested in the matching documents, only in the number of matched documents and the nested objects.
Thanks for your help.

What you need to do in terms of Elasticsearch is:
filter "parent" documents on desired criteria (like having GPU in title, or also mentioning Nvidia in the companies list);
group "nested" documents by a certain criteria, a bucket (e.g. company_id);
count how many "nested" documents there are per each bucket.
Each of the nested objects in the array are indexed as a separate hidden document, which complicates life a bit. Let's see how to aggregate on them.
So how to aggregate and count the nested documents?
You can achieve this with a combination of a nested, terms and top_hits aggregation:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
This will give the following output:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
Notice that for Nvidia we have "doc_count": 2.
But what if we want to count the number of "parent" objects who's got Nvidia vs Intel?
What if we want to count parent objects based on a nested bucket?
It can be achieved with reverse_nested aggregation.
We need to change our query just a little bit:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
The result will look like this:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
How can I spot the difference?
To make the difference evident, let's change the data a bit and add another Nvidia item in the document list:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
The last query (the one with reverse_nested) will give us the following:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
As you can see, this is a subtle difference that is hard to grasp, but it changes the semantics completely.
What's about performance?
While for most of the cases the performance of nested query and aggregations should be enough, of course it comes with a certain cost. It is therefore recommended to avoid using nested or parent-child types when tuning for search speed.
In Elasticsearch the best performance is often achieved through denormalization, although there is no single recipe and you should select the data model depending on your needs.
Hope this clarifies this nested thing for you a bit!

Related

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

ElasticSearch Max Agg on lowest value inside a list property of the document

I'm looking to do a Max aggregation on a value of the property under my document, the property is a list of complex object (key and value). Here's my data:
[{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
},
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}]
When I do the Nested Max Aggregation on "listItems.value", I'm expecting the max value returned to be 200 (and not 5000), reason being I want the logic to first figure the MIN value under listItems for each document, then doing the Max Aggregation on that. Is it possible to do something like this?
Thanks.
The search query performs the following aggregation :
Terms aggregation on the id field
Min aggregation on listItems.value
Max bucket aggregation that is a sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s).
Please refer to nested aggregation, to get a detailed explanation on it.
Adding a working example with index data, index mapping, search query, and search result.
Index Mapping:
{
"mappings": {
"properties": {
"listItems": {
"type": "nested"
},
"id":{
"type":"text",
"fielddata":"true"
}
}
}
}
Index Data:
{
"id" : "1",
"listItems" :
[
{
"key" : "li1",
"value" : 100
},
{
"key" : "li2",
"value" : 5000
}
]
}
{
"id" : "2",
"listItems" :
[
{
"key" : "li3",
"value" : 200
},
{
"key" : "li2",
"value" : 2000
}
]
}
Search Query:
{
"size": 0,
"aggs": {
"id_terms": {
"terms": {
"field": "id"
},
"aggs": {
"nested_entries": {
"nested": {
"path": "listItems"
},
"aggs": {
"min_position": {
"min": {
"field": "listItems.value"
}
}
}
}
}
},
"maxValue": {
"max_bucket": {
"buckets_path": "id_terms>nested_entries>min_position"
}
}
}
}
Search Result:
"aggregations": {
"id_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 100.0
}
}
},
{
"key": "2",
"doc_count": 1,
"nested_entries": {
"doc_count": 2,
"min_position": {
"value": 200.0
}
}
}
]
},
"maxValue": {
"value": 200.0,
"keys": [
"2"
]
}
}
Initial post was mentioning nested aggregation, thus i was sure question is about nested documents. Since i've come to solution before seeing another answer, i'm keeping the whole thing for history, but actually it differs only in adding nested aggregation.
The whole process can be explained like that:
Bucket each document into single bucket.
Use nested aggregation to be able to aggregate on nested documents.
Use min aggregation to find minimum value within all document nested documents, and by that, for document itself.
Finally, use another aggregation to calculate maximum value among results of previous aggregation.
Given this setup:
// PUT /index
{
"mappings": {
"properties": {
"children": {
"type": "nested",
"properties": {
"value": {
"type": "integer"
}
}
}
}
}
}
// POST /index/_doc
{
"children": [
{ "value": 12 },
{ "value": 45 }
]
}
// POST /index/_doc
{
"children": [
{ "value": 7 },
{ "value": 35 }
]
}
I can use those aggregations in request to get required value:
{
"size": 0,
"aggs": {
"document": {
"terms": {"field": "_id"},
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"minimum": {
"min": {
"field": "children.value"
}
}
}
}
}
},
"result": {
"max_bucket": {
"buckets_path": "document>children>minimum"
}
}
}
}
{
"aggregations": {
"document": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O4QxyHQBK5VO9CW5xJGl",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 7.0
}
}
},
{
"key": "OoQxyHQBK5VO9CW5kpEc",
"doc_count": 1,
"children": {
"doc_count": 2,
"minimum": {
"value": 12.0
}
}
}
]
},
"result": {
"value": 12.0,
"keys": [
"OoQxyHQBK5VO9CW5kpEc"
]
}
}
}
There also should be a workaround using script for calculating max - all that you will need to do is just find and return smallest value in document in such script.

Multiple key aggregation in ElasticSearch

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -
{"name":"A",
"class":"10th",
"subjects":{
"S1":92,
"S2":92,
"S3":92,
}
}
We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.
We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.
We wrote the following script but this only works if we know all possible subjects.
GET student_data_v1_1/_search
{ "query" :
{"match" :
{ "class" : "' + query + '" }},
"aggs" : { "my_buckets" : { "terms" :
{ "field" : "subjects", "size":10000 },
"aggregations": {"the_avg":
{"avg": { "field": "subjects.value" }}} }},
"size" : 0 }'
but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -
{"name":"A",
"class":"10th",
"subjects":{
"value":93
}
}
An alternate form the document is present is that the subject is a list of dictionaries -
{"name":"A",
"class":"10th",
"subjects":[
{"S1":92},
{"S2":92},
{"S3":92},
]
}
Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======
After updating the document to hold weights for each subject -
{
class": "10th",
"subject": [
{
"name": "s1",
"marks": 90,
"weight":30
},
{
"name": "s2",
"marks": 80,
"weight":70
}
]}
I have updated the query to be -
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "scores"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
}
}
}
}
},
"size": 0
}
but it throws the error-
{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
To achieve the required result I would suggest you to keep your index mapping as follows:
{
"properties": {
"class": {
"type": "keyword"
},
"subject": {
"type": "nested",
"properties": {
"marks": {
"type": "integer"
},
"name": {
"type": "keyword"
}
}
}
}
}
In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.
Sample doc:
{
"class": "10th",
"subject": [
{
"name": "s1",
"marks": 90
},
{
"name": "s2",
"marks": 80
}
]
}
Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
}
}
}
}
}
},
"size": 0
}
NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.
Sample result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"subjects": {
"doc_count": 6,
"subjects": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "s1",
"doc_count": 3,
"avg_score": {
"value": 80
}
},
{
"key": "s2",
"doc_count": 2,
"avg_score": {
"value": 75
}
},
{
"key": "s3",
"doc_count": 1,
"avg_score": {
"value": 80
}
}
]
}
}
}
}
As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.
UPDATE to include weighted_avg:
{
"query": {
"match": {
"class": "10th"
}
},
"aggs": {
"subjects": {
"nested": {
"path": "subject"
},
"aggs": {
"subjects": {
"terms": {
"field": "subject.name"
},
"aggs": {
"avg_score": {
"avg": {
"field": "subject.marks"
}
},
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "subject.marks"
},
"weight": {
"field": "subject.weight"
}
}
}
}
}
}
}
},
"size": 0
}

ElasticSearch 1x - aggregate on object conditions

I want to aggregate on data, which has inner objects. For example:
{
"_index": "product_index-en",
"_type": "elasticproductmodel",
"_id": "000001111",
"_score": 6.3316255,
"_source": {
"productId": "11111111111",
"productIdOnlyLetterAndDigit": "11111111111",
"productIdOnlyDigit": "11111111111",
"productNumber": "11111111111",
"name": "Glow Plug",
"nameOnlyLetterAndDigit": "glowplug",
"productImageLarge": "11111111111.jpg",
"itemGroupId": "11111",
"relatedProductIds": [],
"dataAreaCountries": [
"fra",
"pol",
"uk",
"sie",
"sve",
"atl",
"ita",
"hol",
"dk"
],
"oemItems": [
{
"manufactorName": "BERU",
"manufacType": "0"
},
{
"manufactorName": "LUCAS",
"manufacType": "0"
}
]
}
}
I need to be able aggregates oemItems.manufactorName values, but only where oemItems.manufacType is "0". I have tried a number of examples, such as the accepted one here ( Elastic Search Aggregate into buckets on conditions ), but I just cannot seem to wrap my head around it.
I've tried following, hopeing it will aggragate on manufacType first, which it does, and then manufactorName for each type, which it seems to display correct hit count. However, buckets for manufactorName are empty:
GET /product_index-en/_search
{
"size": 0,
"aggs": {
"baked_goods": {
"nested": {
"path": "oemItems"
},
"aggs": {
"test1": {
"terms": {
"field": "oemItems.manufacType",
"size": 500
},
"aggs": {
"test2": {
"terms": {
"field": "oemItems.manufactorName",
"size": 500
}
}
}
}
}
}
}
}
And the result:
{
"took": 27,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 471214,
"max_score": 0,
"hits": []
},
"aggregations": {
"baked_goods": {
"doc_count": 677246,
"test1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "0",
"doc_count": 436557,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
{
"key": "1",
"doc_count": 240689,
"test2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
]
}
}
}
}
I have also tried to add a nested term filter, to only look at oemItems which have manufacType 1 with following query. However, it returns Objects where oemItems include manufacType 1, meaning it oemItems within products still contain either 1 or 0 manufacType. I don't see how doing an aggregate on this response will only return oemItems.manufactorName where oemItems.manufacType is 0
GET /product_index-en/_search
{
"query" : { "match_all" : {} },
"filter" : {
"nested" : {
"path" : "oemItems",
"filter" : {
"bool" : {
"must" : [
{
"term" : {"oemItems.manufacType" : "1"}
}
]
}
}
}
}
}
Good start so far. Just try it like this:
POST /product_index-en/_search
{
"size": 0,
"query": {
"nested": {
"path": "oemItems",
"query": {
"term": {
"oemItems.manufacType": "0"
}
}
}
},
"aggs": {
"baked_goods": {
"nested": {
"path": "oemItems"
},
"aggs": {
"test1": {
"terms": {
"field": "oemItems.manufactorName",
"size": 500
}
}
}
}
}
}

Elasticsearch Terms or Cardinality Aggregation - Order by number of distinct values

Friends,
I am doing some analysis to find unique pairs from 100s of millions of documents. The mock example is as shown below:
doc field1 field2
AAA : BBB
AAA : CCC
PPP : QQQ
PPP : QQQ
XXX : YYY
XXX : YYY
MMM : NNN
90% of the document contains an unique pair as shown above in doc 3, 4, 5, 6 and 7 which I am not interested on my aggregation result. I am interested to aggregate doc 1 and 2.
Terms Aggregation Query:
"aggs": {
"f1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2
},
"aggs": {
"f2": {
"terms": {
"field": "FIELD2"
}
}
}
}
}
Term Aggregation Result
"aggregations": {
"f1": {
"buckets": [
{
"key": "PPP",
"doc_count": 2,
"f2": {
"buckets": [
{
"key": "QQQ",
"doc_count": 2
}
]
}
},
{
"key": "XXX",
"doc_count": 2,
"f2": {
"buckets": [
{
"key": "YYY",
"doc_count": 2
}
]
}
},
{
"key": "AAA",
"doc_count": 2,
"f2": {
"buckets": [
{
"key": "BBB",
"doc_count": 1
},
{
"key": "CCC",
"doc_count": 1
}
]
}
}
]
}
}
I am interested only on key AAA to be in the aggregation result. What is the best way to filter the aggregation result containing distinct pairs?
I tried with cardinality aggregation which result unque value count. However I am not able to filter out what I am not interested from the aggregation results.
Cardinality Aggregation Query
"aggs": {
"f1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2
},
"aggs": {
"f2": {
"cardinality": {
"field": "FIELD2"
}
}
}
}
}
Cardinality Aggregation Result
"aggregations": {
"f1": {
"buckets": [
{
"key": "PPP",
"doc_count": 2,
"f2": {
"value" : 1
}
},
{
"key": "XXX",
"doc_count": 2,
"f2": {
"value" : 1
}
},
{
"key": "AAA",
"doc_count": 2,
"f2": {
"value" : 2
}
}
]
}
}
Atleast if I could sort by cardinal value, that would be help me to find some workarounds. Please help me in this regard.
P.S: Writing a spark/mapreduce program to post process/filter the aggregation result is not expected solution for this issue.
I suggest to use filter query along with aggregations, since you are only interested in field1=AAA.
I have a similar example here.
For example, I have an index of all patients in my hospital. I store their drug use in a nested object DRUG. Each patient could take different drugs, and each could take a single drug for multiple times.
Now if I wanted to find the number of patients who took aspirin at least once, the query could be:
{
"size": 0,
"_source": false,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "DRUG",
"filter": {
"bool": {
"must": [{ "term": { "DRUG.NAME": "aspirin" } }]
}}}}}},
"aggs": {
"DRUG_FACETS": {
"nested": {
"path": "DRUG"
},
"aggs": {
"DRUG_NAME_FACETS": {
"terms": { "field": "DRUG.NAME", "size": 0 },
"aggs": {
"DISTINCT": { "cardinality": { "field": "DRUG.PATIENT" } }
}
}}}}
}
Sample result:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"DRUG_FACETS": {
"doc_count": 11,
"DRUG_NAME_FACETS": {
"buckets": [
{
"key": "aspirin",
"doc_count": 6,
"DISTINCT": {
"value": 6
}
},
{
"key": "vitamin-b",
"doc_count": 3,
"DISTINCT": {
"value": 2
}
},
{
"key": "vitamin-c",
"doc_count": 2,
"DISTINCT": {
"value": 2
}
}
]
}
}
}
}
The first one in the buckets would be aspirin. But you can see other 2 patients had also taken vitamin-b when they took aspirin.
If you change the field value of DRUG.NAME to another drug name for example "vitamin-b", I suppose you would get vitamin-b in the first position of the buckets.
Hopefully this is helpful to your question.
A bit late, hope it would help for others.
A simple approach is to filter only 'AAA' records in top aggregation:
{
"size": 0,
"aggregations": {
"filterAAA": {
"filter": {
"term": {
"FIELD1": "AAA"
}
},
"aggregations": {
"f1": {
"terms": {
"field": "FIELD1",
"min_doc_count": 2
},
"aggregations": {
"f2": {
"terms": {
"field": "FIELD2"
}
}
}
}
}
}
}
}

Resources