ES query to group by parent_id

ES query to group by parent_id - elasticsearch

I have the below data in my elasticsearch. I would like to query these data with group.
{
"id" : "001",
"parent_id" : "001",
"name" : "test001"
},
{
"id" : "002",
"parent_id" : "001",
"name" : "test002"
},
{
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
{
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
Here is my expected format:
{
"id" : "001",
"parent_id" : "001",
"name" : "test001"
"children": [
{
"id" : "002",
"parent_id" : "001",
"name" : "test002"
},
{
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
]
},
{
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
Is there any way I can achieve this using elastic search query?

Assuming that parent_id is of the keyword field type, and/or has a multi-field mapping similar to:
"parent_id" : {
"type" : "text",
"fields" : {
"keyword" : { <---
"type" : "keyword"
}
}
}
you could first group all your documents by parent_id.keyword and then list all the children (including #001) using a top_hits aggregation:
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.children.hits.hits._source
{
"size": 0,
"aggs": {
"by_parent_id": {
"terms": {
"field": "parent_id.keyword",
"size": 10
},
"aggs": {
"children": {
"top_hits": {
"sort": [
{
"id.keyword": {
"order": "asc"
}
}
],
"size": 10
}
}
}
}
}
}
yielding
{
"aggregations" : {
"by_parent_id" : {
"buckets" : [
{
"key" : "001",
"children" : {
"hits" : {
"hits" : [
{
"_source" : {
"id" : "001",
"parent_id" : "001",
"name" : "test001"
}
},
{
"_source" : {
"id" : "002",
"parent_id" : "001",
"name" : "test002"
}
},
{
"_source" : {
"id" : "003",
"parent_id" : "001",
"name" : "test003"
}
}
]
}
}
},
{
"key" : "004",
"children" : {
"hits" : {
"hits" : [
{
"_source" : {
"id" : "004",
"parent_id" : "004",
"name" : "test004"
}
}
]
}
}
}
]
}
}
}
You can also sort the children by a metric of your choice — perhaps by id.keyword:
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.children.hits.hits._source
{
"size": 0,
"aggs": {
"by_parent_id": {
"terms": {
"field": "parent_id.keyword",
"size": 10
},
"aggs": {
"children": {
"top_hits": {
"sort": [ <---
{
"id.keyword": {
"order": "asc"
}
}
],
"size": 10
}
}
}
}
}
}
Finally, you can control the order of the top-level, terms aggregation too.

Related

Elasticsearch DSL query

{
"name" : "Danny",
"id" : "123",
"lastProfileUpdateTime" : "2021-06-26T20:08:25.089Z"
},
{
"name" : "Harry",
"id" : "124",
"lastProfileUpdateTime" : "2021-04-12T20:08:25.089Z"
},
{
"name" : "Danny Brown",
"id" : "123",
"lastProfileUpdateTime" : "2021-07-26T20:08:25.089Z"
},
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
I have a usecase where i need to find if a particular id has been updated or not and filter out latest profile. In above case since id:123 has updated profile. so expected outcome should be:
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
If a id has more than one entry, pick the one which has latest lastProfileUpdateTime.

This can be done using
Terms aggregation
: to group by "id"
Bucket selector: to get ids where count > 1
Top_hits: To get latest document for given id
Query
{
"size": 0,
"aggs": {
"Updated_docs": {
"terms": {
"field": "id.keyword",
"size": 10
},
"aggs": {
"filter_count_more_than_one": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
},
"script": "params.count>1"
}
},
"latest_document": {
"top_hits": {
"size": 1,
"sort": [
{
"lastProfileUpdateTime": "desc"
}
]
}
}
}
}
}
}
Result
"aggregations" : {
"Updated_docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123",
"doc_count" : 3,
"latest_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index4",
"_type" : "_doc",
"_id" : "huE1i3sBZ5_aIkj8cvpR",
"_score" : null,
"_source" : {
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
},
"sort" : [
1630008505089
]
}
]
}
}
}
]
}
}

How to search Parent documents along with count of associated child documents

I am looking for a best way to search parent documents along with counts for associated child document? Example :
We have Organization documents and User documents. There could be thousands of users belong to one particular organization.
Organization document :
{
"id" : "001"
"name" : "orgname1"
}
{
"id" : "002"
"name" : "orgname2"
}
Users documents :
{
"id" : "testusr1"
"name" : "xyz1"
"orgId" : "001"
},
{
"id" : "testusr2"
"name" : "xyz2"
"orgId" : "001"
}
{
"id" : "testusr3"
"name" : "xyz3"
"orgId" : "001"
}
{
"id" : "testusr4"
"name" : "xyz4"
"orgId" : "001"
}
{
"id" : "testusr5"
"name" : "xyz5"
"orgId" : "002"
}
{
"id" : "testusr6"
"name" : "xyz6"
"orgId" : "002"
}
In above example, we have 4 users associated with organization with 001 and 2 users associated with 002. So on front end, admin will search for organization and as a result, I want to give response along with users count for that organization.

You can solve you issue in three ways. Each have its own advantages and disadvantages
1. Index Parent and child separately
This will require two queries . First you need to query user index and get orgId and then query child index and get its count
Advantage.
Change in one index doesn't affect other index
Disadvantage .
You need to use two queries
2. Nested Documents
Mapping:
PUT index9
{
"mappings": {
"properties": {
"id":{
"type": "integer"
},
"name":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"user":{
"type": "nested",
"properties": {
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
}
POST index9/_doc
{
"id" : 1,
"name" : "orgname1",
"user":[
{
"id":"testuser1",
"name":"xyz1"
},
{
"id":"testuser2",
"name":"xyz2"
}
]
}
Query:
GET index9/_search
{
"query": {
"match_all": {}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"nested": {
"path": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "user.id.keyword"
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
}
Nested are faster compared to parent/child,
Nested docs require reindexing the parent with all its children, while parent child allows to reindex / add / delete specific children.
3. Parent Child Relationship
Mapping
{
"my_index" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "keyword"
},
"my_join_field" : {
"type" : "join",
"eager_global_ordinals" : true,
"relations" : {
"organization" : "user"
}
},
"name" : {
"type" : "text"
},
"orgId" : {
"type" : "long"
}
}
}
}
Data:
POST my_index/_doc/1
{
"id": 1,
"name" : "orgname1",
"my_join_field": "organization"
}
POST my_index/_doc/2
{
"id" : 2,
"name" : "orgname2",
"my_join_field": "organization"
}
POST my_index/_doc/3?routing=1
{
"id": "testusr1",
"name": "xyz1",
"orgId": 1,
"my_join_field": {
"name": "user",
"parent": 1
}
}
POST my_index/_doc/4?routing=2
{
"id" : "testusr5",
"name" : "xyz5",
"orgId" : 1,
"my_join_field": {
"name": "user",
"parent": 2
}
}
POST my_index/_doc/5?routing=2
{
"id" : "testusr6",
"name" : "xyz6",
"orgId" : 2,
"my_join_field": {
"name": "user",
"parent": 2
}
}
Query:
{
"query": {
"has_child": {
"type": "user",
"query": { "match_all": {} }
}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"children": {
"type": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "id"
}
}
}
}
}
}
}
}
Result:
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "orgname1",
"my_join_field" : "organization"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "orgname2",
"my_join_field" : "organization"
}
}
]
},
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"user" : {
"doc_count" : 1,
"count" : {
"value" : 1
}
}
},
{
"key" : "2",
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
Benefits:
1. Parent document and children are separate documents
Parent and child can be updated separately without re-indexing the other
It is useful when child documents are large in number and need to be added or
changed frequently.
Child documents can be returned as the results of a search request.

Elasticsearch, terms aggs according to sibling nested fields

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
"latest" : [
{
"soc_mm_score" : "5",
}
]
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "10",
}
]
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
"latest" : [
{
"soc_mm_score" : "35",
}
]
},
{
'_id' : 1004,
'title' : "Title 4",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "30",
}
]
}
//omitted some other fields
influencers:
{
'_id' : 1,
'name' : "John",
'smp_id' : 1
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3
}
Now I have this simple query that determines which documents in the socialmedia index has the most latest.soc_mm_score value, and also displaying their corresponding influencers determined by the smp_id
GET socialmedia/_search
{
"size": 0,
"_source": "latest",
"query": {
"match_all": {}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"MM_SCORE": {
"terms": {
"field": "latest.soc_mm_score",
"order": {
"_key": "desc"
},
"size": 3
},
"aggs": {
"REVERSE": {
"reverse_nested": {},
"aggs": {
"SMP_ID": {
"top_hits": {
"_source": ["smp_id"],
"size": 1
}
}
}
}
}
}
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"LATEST" : {
"doc_count" : //omitted,
"MM_SCORE" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : 35,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1003",
"_score" : 1.0,
"_source" : {
"smp_id" : "3"
}
}
]
}
}
}
},
{
"key" : 30,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1004",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
},
{
"key" : 10,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1002",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
}
]
}
}
}
with the query above, I was able to successfully display which documents have the highest latest.soc_mm_score values
The sample output above only displays DOCUMENTS, telling that the influencers (a.k.a smp_id) related to them are the TOP INFLUENCERS according to latest.soc_mm_score
Ideally just by using this aggs query,
"terms" : {
"field" : "smp_id"
}
portrays the concept of which influencers are the top according to the doc_count
Now, displaying the terms query according to latest.soc_mm_score displays TOP DOCUMENTS
"terms" : {
"field" : "latest.soc_mm_score"
}
REAL OBJECTIVE:
I want to display the TOP INFLUENCERS according to the latest.soc_mm_count in the socialmedia index. If Elasticsearch can count all the documents where according to unique smp_id, is there a way for ES to sum all latest.soc_mm_score values and use it as terms?
My objective above should output these:
smp_id 2 as the Top Influencer because he has 2 posts (with soc_mm_score of 30 and 10), adding them gets him 40 soc_mm_score
smp_id 3 as the 2nd Top Influencer, he has 1 post with 35 soc_mm_score
smp_id 1 as the 3rd Top Influencer, he has 1 post with 5 soc_mm_score
Is there a proper query to meet this objective?

FINALLY! FOUND AN ANSWER!!!
"aggs": {
"INFS": {
"terms": {
"field": "smp_id.keyword",
"order": {
"LATEST > SUM_SVALUE": "desc"
}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"SUM_SVALUE": {
"sum" : {
"field": "latest.soc_mm_score"
}
}
}
}
}
}
}
Displays the following sample:

elastic: query the sum of a filtered subset of nested documents

Consider a document (post) like this in elasticsearch index:
{
title: "I love ice cream!"
comments: [
{
body: "me too!",
reaction: 'positive',
likes: 20
},
{
body: "huh!",
reaction: 'sarcastic',
likes: 5
}
]
}
The comments is a field of nested type.
How can elastic answer this:
Give me all posts, where the total sum of likes on "sarcastic" comments is greater than 100.
I'm open to any other way of modelling data which helps answer such queries.

This can be solved using bucket selector aggregation.
Mapping:
{
"index1" : {
"mappings" : {
"properties" : {
"comments" : {
"type" : "nested",
"properties" : {
"body" : {
"type" : "text"
},
"likes" : {
"type" : "integer"
},
"reaction" : {
"type" : "text"
}
}
},
"title" : {
"type" : "keyword"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "p0y9DGsBfPdKzuAGdQrm",
"_score" : 1.0,
"_source" : {
"title" : "I love ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 20
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
},
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "qEy9DGsBfPdKzuAGnwox",
"_score" : 1.0,
"_source" : {
"title" : "I hate ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 10
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
}
]
}
Query:
GET index1/_search
{
"size": 0,
"aggs": {
"title": {
"terms": {
"field": "title"
},
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"reaction": {
"filter": {
"term": {
"comments.reaction": "positive"
}
},
"aggs": {
"total_likes": {
"sum": {
"field": "comments.likes"
}
}
}
}
}
},
"total_likes_filter": {
"bucket_selector": {
"buckets_path": {
"likes": "comments>reaction>total_likes"
},
"script": "params.likes > 15"
}
}
}
}
}
}
Result:
"aggregations" : {
"title" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "I love ice cream!",
"doc_count" : 1,
"comments" : {
"doc_count" : 2,
"reaction" : {
"doc_count" : 1,
"total_likes" : {
"value" : 20.0
}
}
}
}
]
}
}
}
Bucket contains only "I love ice cream!" where total likes for reaction positive is greater than 20.
I hate ice cream! has total sum 5 for positive reaction so it is not included.

ElasticSearch get last n distinct records

I am trying to implement a search query over records stored in elasticsearch.
The record structure looks something like this.
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "pWjQLWkBIJk0ORjd0X2P",
"_score" : null,
"_source" : {
"transactionID" : "60ab66cf24c9924f562bf1a2b5d92305d0a6",
"boxNumber" : "Box3",
"createDate" : "2013-09-17T00:00:00",
"itemNumber" : "Item1",
"address" : "Sample Address"
}
}
one box can contain multiple items. For example Box3 can have Item1, Item2 and Item3. So in elasticsearch i will have 3 different documents. Also at the same time, same box and same item can also exist but with different address. The transactionID may or maynot be the same for these documents.
My requirement is to fetch last n recent and distinct transactionIDs, along with their records.
I tried following query to fetch last 7 distinct transactionIDs
GET /box_info_store/boxes/_search?size=7
{
"query": {
"bool": {
"must": [
{"match":{"boxNumber":"Box3"}},
{"match":{"itemNumber":"Item1"}}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": { "field": "transactionID"}
}
}
}
This fetched me last 7 documents where boxNumber is Box3 and itemNumber is Item1, but not 7 distinct transactionIDs, two out of these seven documents have the same transactionID(both having separate address though).
But my requirement is to get 7 distinct transactionIds, no matter how many document it returns.
Hope i was able to explain myself.
Appreciate any kind of help here
Thanks
------Edited #gaurav9620, i ran the first query and got count as 32, then i ran the second query with distinct count as 3 i got the following result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 32,
"max_score" : null,
"hits" : [
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "RWjRLWkBIJk0ORjdEX-L",
"_score" : null,
"_source" : {
"transactionID" : "3087e106244f6247a5290fb21ce64254529c",
"boxNumber" : "Box3",
"createDate" : "2017-11-15T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress12",
},
"sort" : [
1510704000000
]
},
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "MGjQLWkBIJk0ORjdwX0M",
"_score" : null,
"_source" : {
"transactionID" : "60ab66cf24c9924f562bf1a2b5d92305d0a6",
"boxNumber" : "Box3",
"createDate" : "2016-04-03T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress321",
},
"sort" : [
1459641600000
]
},
..........
..........
..........
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "AGjRLWkBIJk0ORjdK4CJ",
"_score" : null,
"_source" : {
"transactionID" : "3087e106244f6247a5290fb21ce64254529c",
"boxNumber" : "Box3",
"createDate" : "1996-02-16T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress4324",
},
"sort" : [
824428800000
]
}
]
},
"aggregations" : {
"unique_transactions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 16,
"buckets" : [
{
"key" : "3087e106244f6247a5290fb21ce64254529c",
"doc_count" : 6
},
{
"key" : "27c5f3422f4482495d29e7b2c15c0e311743",
"doc_count" : 5
},
{
"key" : "c40e53212e74e24bf02a5bd2b134cf92bffb",
"doc_count" : 5
}
]
}
}
}

The size which you have used : represents number of raw documents that are retrieved.
If your case what you need to do is :
Mention size as 0 -> which will return you no raw documents
Include a size parameter in aggregation which will return you unique 7 ids.
GET /box_info_store/boxes/_search?size=7
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": {
"field": "transactionID",
"size": 7
}
}
}
}
EDIT-------------------------------------
First fire this query
GET /box_info_store/boxes/_search?size=0
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
}
}
Here you will find total number of documents matching your query which you can set as n
After this fire your query as below
GET /box_info_store/boxes/_search?size=**n**
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": {
"field": "transactionID",
"size": NUMBER_OF_UNIQUE_TRANSACTION_IDS_TO_BE_FETCHED
}
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ES query to group by parent_id - elasticsearch

Related

Elasticsearch DSL query

How to search Parent documents along with count of associated child documents

Elasticsearch, terms aggs according to sibling nested fields

elastic: query the sum of a filtered subset of nested documents

ElasticSearch get last n distinct records

Categories

Resources