Aggregate nested objects in ElasticSearch - elasticsearch

Let's say we have this document:
{
"Article" : [
{
"id" : 12
"title" : "An article title",
"categories" : [1,3,5,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author" : [
{
"firstname" : "Francois",
"surname": "francoisg",
"id" : 18
},
{
"firstname" : "Gregory",
"surname" : "gregquat"
"id" : "2"
}
]
}
},
{
"id" : 13
"title" : "A second article title",
"categories" : [1,7],
"tag" : ["elasticsearch", "symfony",'Obtao'],
"author" : [
{
"firstname" : "Gregory",
"surname" : "gregquat",
"id" : "2"
}
]
}
}
How can I find all unique authors by id? What is the proper query? I need to return all unique authors ("author.id")
Thanks for help.

First, you should set your mapping with nested type for the field author.
Second, as #Taras_Kohut mentioned, and after you re-indexed the entire data, you can do:
{
"size": 0,
"aggregations": {
"records": {
"nested": {
"path": "author"
},
"aggregations": {
"ids": {
"terms": {
"field": "author.id"
}
}
}
}
}
}
See Nested Aggregation

Related

Elasticsearch Nested 2 Step Sorting

Given the following data with nested objects (members within teams), I need to do a 2 step sort:
Return the youngest member of each team.
Sort the teams by the name of that youngest member.
I have a query below that is close: it does get the youngest member of each team, but then it sorts the teams using the names of all the members, not just the one selected per team.
What would the query be to do this?
And would such a query be performant assuming there was a lot of data? (Probably a few million objects each having 1-3 nested objects.)
Note: Although it's not clear in this simple example, I cannot simply store the youngest member, since in my real world case, the sorting of the nested objects is determined by a formula that includes an external parameter. This is just a very simplified example of the many sorts like this I would have to do on a larger data set, where I need to get the single best matching nested document for each outer document sorted in one way, but then sort the outer objects based on some other property of that selected nested object.
Data
PUT nested_test
{
"mappings": {
"dynamic": "strict",
"properties": {
"team": { "type": "keyword", "index": true, "doc_values": true },
"members": {
"type": "nested",
"properties": {
"name": { "type": "keyword", "index": true, "doc_values": true },
"age": { "type": "integer", "index": true, "doc_values": true}
}
}
}
}
}
PUT nested_test/_doc/1
{
"team" : "A" ,
"members" :
[
{ "name" : "Curt" , "age" : "34" } ,
{ "name" : "Dave" , "age" : "33" }
]
}
PUT nested_test/_doc/2
{
"team" : "B" ,
"members" :
[
{ "name" : "Alex" , "age" : "36" } ,
{ "name" : "Earl" , "age" : "32" }
]
}
PUT nested_test/_doc/3
{
"team" : "C" ,
"members" :
[
{ "name" : "Brad" , "age" : "35" } ,
{ "name" : "Gary" , "age" : "31" }
]
}
Attempted Query
GET nested_test/_search?filter_path=hits.hits._source.team,hits.hits.sort.*,hits.hits.inner_hits.members.hits.hits._source.*,hits.hits.inner_hits.members.hits.hits.sort.*
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "members",
"query": {
"match_all" : { }
} ,
"inner_hits": {
"size": 1,
"sort": {
"members.age": { "order": "asc" }
}
}
}
}
]
}
}
,
"sort": [
{ "members.name": {
"order": "asc" ,
"nested": {
"path": "members",
"filter": { "match_all" : { } }
}
} }
]
}
Results (If the query was correct, the teams would be in A, B, C order, but they are B, C, A)
{
"hits" : {
"hits" : [
{
"_source" : {
"team" : "B"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Earl",
"age" : "32"
}
}
]
}
}
}
},
{
"_source" : {
"team" : "C"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Gary",
"age" : "31"
}
}
]
}
}
}
},
{
"_source" : {
"team" : "A"
},
"inner_hits" : {
"members" : {
"hits" : {
"hits" : [
{
"_source" : {
"name" : "Dave",
"age" : "33"
}
}
]
}
}
}
}
]
}
}
I not feasable with nested sort. And you cant use the result of the inner_hits to sort your documents.
You could maybe use some runtime field with a complex script to extract the name of the youngest member at search time, but it will certainly be ugly and the performance of the query will be impacted, it will perform poorly at scale.
Since you use a nested model, you have all the data needed during indexation to store the youngest member name in a specific field at the root of the document.
Then you will be able to use a standard sort for this use case.
Its the right way to do it in Elasticsearch it you want to keep the performance.

ES query to group by parent_id

I have the below data in my elasticsearch. I would like to query these data with group.
{
"id" : "001",
"parent_id" : "001",
"name" : "test001"
},
{
"id" : "002",
"parent_id" : "001",
"name" : "test002"
},
{
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
{
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
Here is my expected format:
{
"id" : "001",
"parent_id" : "001",
"name" : "test001"
"children": [
{
"id" : "002",
"parent_id" : "001",
"name" : "test002"
},
{
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
]
},
{
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
Is there any way I can achieve this using elastic search query?
Assuming that parent_id is of the keyword field type, and/or has a multi-field mapping similar to:
"parent_id" : {
"type" : "text",
"fields" : {
"keyword" : { <---
"type" : "keyword"
}
}
}
you could first group all your documents by parent_id.keyword and then list all the children (including #001) using a top_hits aggregation:
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.children.hits.hits._source
{
"size": 0,
"aggs": {
"by_parent_id": {
"terms": {
"field": "parent_id.keyword",
"size": 10
},
"aggs": {
"children": {
"top_hits": {
"sort": [
{
"id.keyword": {
"order": "asc"
}
}
],
"size": 10
}
}
}
}
}
}
yielding
{
"aggregations" : {
"by_parent_id" : {
"buckets" : [
{
"key" : "001",
"children" : {
"hits" : {
"hits" : [
{
"_source" : {
"id" : "001",
"parent_id" : "001",
"name" : "test001"
}
},
{
"_source" : {
"id" : "002",
"parent_id" : "001",
"name" : "test002"
}
},
{
"_source" : {
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
}
]
}
}
},
{
"key" : "004",
"children" : {
"hits" : {
"hits" : [
{
"_source" : {
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
}
]
}
}
}
]
}
}
}
You can also sort the children by a metric of your choice — perhaps by id.keyword:
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.children.hits.hits._source
{
"size": 0,
"aggs": {
"by_parent_id": {
"terms": {
"field": "parent_id.keyword",
"size": 10
},
"aggs": {
"children": {
"top_hits": {
"sort": [ <---
{
"id.keyword": {
"order": "asc"
}
}
],
"size": 10
}
}
}
}
}
}
Finally, you can control the order of the top-level, terms aggregation too.

Unable to retrieve nested object within Elastic Search

An ELK noob here, having the ELK task drop to me last minute.
We are adding an extra data named prospects into the vehicle index, so the user could search for it. I'm able to to add the prospects into the index, now I'm unable to get the nested prospects obj within the vehicle index. I'm using Elastic Search & Kibana v6.8.11, and elastic-search-rails gem and checked up the docs on nested object. My search method looks correct according to the docs. Would like some expert to point out what when wrong here, please let me know if you need more info.
Here is the suppose index obj -
{
"_index" : "vehicles",
"_type" : "_doc",
"_id" : "3MZBxxxxxxx",
"_score" : 0.0,
"_source" : {
"vin" : "3MZBxxxxxxx",
"make" : "mazda",
"model" : "mazda3",
"color" : "unknown",
"year" : 2018,
"vehicle" : "2018 mazda mazda3",
"trim" : "grand touring",
"estimated_mileage" : null,
"dealership" : [
209
],
"current_owner_group_id" : null,
"current_owner_customer_id" : null,
"last_service_date" : null,
"last_service_revenue" : null,
"purchase_type" : [ ],
"in_service_date" : null,
"deal_headers" : [ ],
"services" : [ ],
"customers" : [ ],
"salesmen" : null,
"service_appointments" : [ ],
"prospects" : [
{
"first_name" : "Kammy",
"last_name" : "Maytag",
"name" : "Kammy Maytag",
"company_name" : null,
"emails" : [ ],
"phone_numbers" : [ ],
"address" : "31119 field",
"city" : "helen",
"state" : "keller",
"zip" : "81411",
"within_dealership_aoi_region" : true,
"dealership_ids" : [
209
],
"dealership_dppa_protected_ids" : [
209
],
"registration_id" : 12344,
"id" : 1054,
"prospect_source_id" : "12344",
"type" : "Prospect"
}
]
}
}
]
}
}
Here is how I'm trying to get it -
GET /vehicles/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": [
{ "term": { "dealership": "209" } },
{
"nested": {
"path": "prospects",
"query": {
"bool": {
"must": [
{ "term": { "prospects.first_name": "Kammy" } },
{ "term": { "prospects.dealership": "209" } },
{ "term": { "prospects.type": "Prospect" } }
]
}
}
}
},
{ "bool": { "must_not": { "term": { "purchase_type": "Wholesale" } } } }
]
}
},
"sort": [{ "_doc": { "order": "asc" } }]
}
I see two issues with the nested query:
You're querying prospects.dealership but the example doc only shows prospects.dealership_ids. Change query to target prospects.dealership_ids.
More importantly, you're using a term query on prospects.first_name and prospects.type. I'm assuming your index mapping doesn't define those as keywords which means that they were most likely lowercased (for reasons explained here) but term is looking for exact matches.
Option 1: Use match instead of term.
Option 2: Change prospects.first_name → prospects.first_name.keyword and do the same for .type.

How to search Parent documents along with count of associated child documents

I am looking for a best way to search parent documents along with counts for associated child document? Example :
We have Organization documents and User documents. There could be thousands of users belong to one particular organization.
Organization document :
{
"id" : "001"
"name" : "orgname1"
}
{
"id" : "002"
"name" : "orgname2"
}
Users documents :
{
"id" : "testusr1"
"name" : "xyz1"
"orgId" : "001"
},
{
"id" : "testusr2"
"name" : "xyz2"
"orgId" : "001"
}
{
"id" : "testusr3"
"name" : "xyz3"
"orgId" : "001"
}
{
"id" : "testusr4"
"name" : "xyz4"
"orgId" : "001"
}
{
"id" : "testusr5"
"name" : "xyz5"
"orgId" : "002"
}
{
"id" : "testusr6"
"name" : "xyz6"
"orgId" : "002"
}
In above example, we have 4 users associated with organization with 001 and 2 users associated with 002. So on front end, admin will search for organization and as a result, I want to give response along with users count for that organization.
You can solve you issue in three ways. Each have its own advantages and disadvantages
1. Index Parent and child separately
This will require two queries . First you need to query user index and get orgId and then query child index and get its count
Advantage.
Change in one index doesn't affect other index
Disadvantage .
You need to use two queries
2. Nested Documents
Mapping:
PUT index9
{
"mappings": {
"properties": {
"id":{
"type": "integer"
},
"name":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"user":{
"type": "nested",
"properties": {
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
}
POST index9/_doc
{
"id" : 1,
"name" : "orgname1",
"user":[
{
"id":"testuser1",
"name":"xyz1"
},
{
"id":"testuser2",
"name":"xyz2"
}
]
}
Query:
GET index9/_search
{
"query": {
"match_all": {}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"nested": {
"path": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "user.id.keyword"
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
}
Nested are faster compared to parent/child,
Nested docs require reindexing the parent with all its children, while parent child allows to reindex / add / delete specific children.
3. Parent Child Relationship
Mapping
{
"my_index" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "keyword"
},
"my_join_field" : {
"type" : "join",
"eager_global_ordinals" : true,
"relations" : {
"organization" : "user"
}
},
"name" : {
"type" : "text"
},
"orgId" : {
"type" : "long"
}
}
}
}
Data:
POST my_index/_doc/1
{
"id": 1,
"name" : "orgname1",
"my_join_field": "organization"
}
POST my_index/_doc/2
{
"id" : 2,
"name" : "orgname2",
"my_join_field": "organization"
}
POST my_index/_doc/3?routing=1
{
"id": "testusr1",
"name": "xyz1",
"orgId": 1,
"my_join_field": {
"name": "user",
"parent": 1
}
}
POST my_index/_doc/4?routing=2
{
"id" : "testusr5",
"name" : "xyz5",
"orgId" : 1,
"my_join_field": {
"name": "user",
"parent": 2
}
}
POST my_index/_doc/5?routing=2
{
"id" : "testusr6",
"name" : "xyz6",
"orgId" : 2,
"my_join_field": {
"name": "user",
"parent": 2
}
}
Query:
{
"query": {
"has_child": {
"type": "user",
"query": { "match_all": {} }
}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"children": {
"type": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "id"
}
}
}
}
}
}
}
}
Result:
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "orgname1",
"my_join_field" : "organization"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "orgname2",
"my_join_field" : "organization"
}
}
]
},
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"user" : {
"doc_count" : 1,
"count" : {
"value" : 1
}
}
},
{
"key" : "2",
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
Benefits:
1. Parent document and children are separate documents
Parent and child can be updated separately without re-indexing the other
It is useful when child documents are large in number and need to be added or
changed frequently.
Child documents can be returned as the results of a search request.

ElasticSearch : search and return nested type

I am pretty new to ElasticSearch and I am having trouble using nested mapping / query.
I have the following data structure added to my index :
{
"_id": "3",
"_rev": "6-e9e1bc15b39e333bb4186de05ec1b167",
"skuCode": "test",
"name": "Dragon vol. 1",
"pages": [
{
"id": "1",
"tags": [
{
"name": "dragon"
},
{
"name": "japonese"
}
]
},
{
"id": "2",
"tags": [
{
"name": "tagforanotherpage"
}
]
}
]
}
This index mapping is defined as bellow :
{
"metabook" : {
"metabook" : {
"properties" : {
"_rev" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"pages" : {
"type" : "nested",
"properties" : {
"tags" : {
"properties" : {
"name" : {
"type" : "string"
}
}
}
}
},
"skuCode" : {
"type" : "string"
}
}
}
}
}
My goal is to search all pages containing a specific tag, and return the book object with the filtered page list (I would like ES to return only pages that match the given tag). Something like (ignoring the second page) :
{
"_id": "3",
"_rev": "6-e9e1bc15b39e333bb4186de05ec1b167",
"skuCode": "test",
"name": "Dragon vol. 1",
"pages": [
{
"id": "1",
"tags": [
{
"name": "dragon"
},
{
"name": "japonese"
}
]
}
]
}
Here is the query I actually use :
{
"from": 0,
"size": 10,
"query" : {
"nested" : {
"path" : "pages",
"score_mode" : "avg",
"query" : {
"term" : { "tags.name" : "japonese" }
}
}
}
}
But it actually returns an empty result. What am I doing wrong ? Maybe I should index my "pages" directly instead of books ? What am I missing ?
Thank you in advance !
Sadly you can't get back only parts of the a document. If the document matches a query, you will get the whole thing back; the root and all nested docs. If you want to get only parts back, then you could look at using parent/child docs.
Also you aren't seeing any hits as you have a small syntax error in the nested query. Look closely at the field name:
{
"from": 0,
"size": 10,
"query" : {
"nested" : {
"path" : "pages",
"score_mode" : "avg",
"query" : {
"term" : { "pages.tags.name" : "japonese" }
}
}
}
}
If you need help with parent child docs feel free to ask! (There should be examples if you do a google search)
Good luck!

Resources