3-level parent-child relationships: documents have a _parent field, but a has_child query returns no results - elasticsearch

I have three types of documents: request, job, and driver. There's a parent/child relationship between request and job, and a parent/child relationship between job and driver. This query:
`GET batch-val-weekly-2015-02-v1/driver/_search
{
"query": {
"match_all": {}
},
"fields": ["_parent"]
}`
returns a list of documents. The parent field is present in each and gives the id of the appropriate job document. But this query:
'GET batch-val-weekly-2015-02-v1/job/_search
{
"query": {
"has_child": {
"type": "driver",
"query": {
"match_all": {}
}
}
}
}'
returns no hits. I also get no hits from a has_parent query.
EDIT: This was answered in Elasticsearch deeper level Parent-child relationship (grandchild). See my explanation below.

When I take out the property "driver" of the type "driver" it seems to work fine. If that was a mistake, then removing it seems to solve the problem. More concretely, first I create the index (notice the difference in the definition of "driver" from what you posted above):
DELETE /test_index
PUT /test_index
{
"mappings": {
"request": {
"_all": {
"enabled": false
},
"_timestamp": {
"enabled": true
}
},
"job": {
"_all": {
"enabled": false
},
"_timestamp": {
"enabled": true
},
"_parent": {
"type": "request"
}
},
"driver": {
"_all": {
"enabled": false
},
"_parent": {
"type": "job"
},
"properties": {
"rules": {
"type": "nested"
}
}
}
}
}
add three docs related to each other:
PUT /test_index/_bulk
{"index": {"_index": "test_index", "_type": "request", "_id": 1}}
{}
{"index": {"_index": "test_index", "_type": "job", "_id": 1, "_parent":1}}
{}
{"index": {"_index": "test_index", "_type": "driver", "_id": 1, "_parent":1}}
{}
first query works:
POST /test_index/driver/_search
{
"query": {
"match_all": {}
},
"fields": [
"_parent"
]
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "driver",
"_id": "1",
"_score": 1,
"fields": {
"_parent": "1"
}
}
]
}
}
second query works:
POST /test_index/job/_search
{
"query": {
"has_child": {
"type": "driver",
"query": {
"match_all": {}
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "job",
"_id": "1",
"_score": 1,
"_source": {}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/602af489fbfb6595c65cd27cff7a1926642ea205

It turns out that has_parent and has_child queries are problematic with more than two generations. That's because these queries rely on the parent and child being on the same shard. By default, documents are routed to shards using their ids (more on this here). When you establish a parent-child relationship, child documents are routed based on their parent IDs, to ensure that they end up on the same shard.
So if you have three generations -- parents, children, grandchildren -- then the children are routed using the parents' IDs, and the grandchildren are routed using the childrens' IDs. But the children weren't routed using their own IDs, they were routed using the parents' IDs; so the grandchildren don't end up on the right shards and has_parent/has_child queries can't find them.
The way to fix this is to route the grandchildren based on the parents' IDs. This is configurable, see the Elasticsearch routing documentation here. You can also set up custom routing via the Java API. That's what I did, and it fixed the problem.
Check out this stackoverflow answer for more: Elasticsearch deeper level Parent-child relationship (grandchild)

Related

Elasticsearch search for a child and all his sibling documents grouped by parent

I would like to be able to submit a query which matches on child documents and returns the parent and all his child documents.
I have parent and child documents in my Elasticsearch index related through a join: https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html?baymax=rec&rogue=rec-1&elektra=guide.
I have items divided into groups, each item in my index is a separate child document(NOTE: It's required to be able search children separately by different query, so I can NOT use Nested objects). The parent document contains a few meaningful fields like (name, sku, image) so it's required to get Parent along with its children.
I've achieved my requirements using following query:
GET my_index/_search
{
"query": {
"has_child": {
"type": "child",
"query": {
"has_parent": {
"parent_type": "parent",
"query": {
"has_child": {
"type": "child",
"query": {
"multi_match": {
"query": "NV1540JR",
"fields": [
"name",
"sku"
]
}
}
}
}
}
},
"inner_hits": {}
}
}
}
It's returns following result, which is exactly what I need:
{
"took": 301,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9GEAT",
"_score": 1.0,
"_source": {
"id": "Az9GEAT",
"name": "Gold Calacatta 2.0",
"sku": "NV1540",
"my_join-field": "parent"
},
"inner_hits": {
"child": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "zx9EEAR",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "zx9EEAR",
"name": "Gold Calacatta 12\" x 24\"",
"sku": "NV1540M-2",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9NEAT",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "Az9NEAT",
"name": "Gold Calacatta 2.0, 24\" x 48\"",
"sku": "NV1540JR",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
}
]
}
}
}
}
]
}
}
In other way I could implement Application-side Join by making three different query calls(one to get all matching data, second to get siblings, third to get parents) and combining result in my Application. But not sure that it gonna be faster, cos of http request time and data processing time.
So, I'm a very newbee in elasticsearch and can't estimate how bad it is. How does it's affects the query performance? If there any other ways to get desired result? Or how my query could be improved? I'd be glad to hear any suggestions or thoughts! Thanks
For ES it's a standard practice to retrieve a list of object ids & performs a second request to return a complete document set.
You can implement your logic using 2 queries
Request (1) all documents satisfying your child search criteria. Select only child.id & child.parent_id fields to ensure you load only index data, no document _source searched. Request will be relatively fast
In your application code determine unique list of parent_ids & orphaned_child_ids
Request (2) all documents satisfying criteria: parent_id in parent_ids OR parent_id = NULL AND child_id in orphaned_child_ids

Elasticsearch OR query with nested objects returns inner_hits not matching the criteria

I'm getting weird results when querying nested objects. Imagine the following structure:
{ owner.name = "fred",
...,
pets [
{ name = "daisy", ... },
{ name = "flopsy", ... }
]
}
If I only have the document shown above, and I search pets matching this criteria:
pets.name = "daisy" OR
(owner.name = "julie" and pet.name = "flopsy")
I would expect to only get one result ("daisy"), but I'm getting both pet names.
This is one way to reproduce this:
# Create nested mapping
PUT pet-owners
{
"mappings": {
"animals": {
"properties": {
"owner": {"type": "text"},
"pets": {
"type": "nested",
"properties": {
"name": {"type": "text", "fielddata": true}
}
}
}
}
}
}
# Insert nested object
PUT pet-owners/animals/1?op_type=create
{
"owner" : "fred",
"pets" : [
{ "name" : "daisy"},
{ "name" : "flopsy"}
]
}
# Query
GET pet-owners/_search
{ "from": 0, "size": 50,
"query": {
"constant_score": {
"filter": { "bool": {"must": [
{"bool": {"should": [
{"nested": {"query":
{"term": {"pets.name": "daisy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_1",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}},
{"bool": {"must": [
{"term": {"owner": "julie"}},
{"nested": {"query":
{"term": {"pets.name": "flopsy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_2",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}}
]}}
]}}
]}}}},
"_source": false
}
The query returns both pets names (as opposed to the expected one).
Is this behavior normal? Am I doing something wrong, or my reasoning about the nested structure or the query behavior is flawed?
Any help or guidance will be much appreciated.
I'm running this query under ElasticSearch 6.3.x
EDIT: I'm adding the response received, to better illustrate the case
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_score": 1,
"inner_hits": {
"pets_hits_1": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 0
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"daisy"
]
}
}
]
}
},
"pets_hits_2": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 1
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"flopsy"
]
}
}
]
}
}
}
}
]
}
}
So we can see that it's not that the query matches and returns the whole existing document, but that it returns each of the pets independently, one inside each of the inner_hits. It's this result that's surprising to me.
(edited) - in summary this issue is around the context of the 'inner_hits':
It looks like the inner_hits 'pets_hits_2' is returning a match because it is belonging to the nested query that simply searches the pets field for 'flopsy'.
As an independent query on our single document, that is a valid hit.
However, because that query is within a list of bool/must queries, where other queries will not match on our document, you may well expect that the inner_hits should pick up on this and therefore not return a hit.
I haven't been able to find any docs to clarify whether this is intentional behaviour or not - might be worth raising with elastic ...

ElasticSearch: Grandchild/child/parent relations not working properly

I'm facing some odd behavior of elastic-search while searching grand child. My grand child doesn't recognizes each n every parent document. When I ask elastic-search to return me children of a parent, it returns all the possible hits. Then when i ask to return me those children which have grand child, then I get incorrect results. Some time i get no hits or lesser. But when i check the routing and parent id of my grand child then I found that they do exists in their parent. But I can't understand why I'm getting incorrect results. Do anybody of you has encountered such types of issues???
I checked my code thrice and didn't found any type error :-(
Let me show you the steps to reproduce this error.
Here is my mapping:
PUT /test_index
{
"mappings":{
"parentDoc":{
"properties":{
"id":{
"type":"integer"
},
"name":{
"type":"text"
}
}
},
"childDoc": {
"_parent": {
"type": "parentDoc"
},
"properties":{
"id":{
"type":"integer"
},
"name":{
"type":"text"
},
"contact": {
"type":"text"
}
}
},
"grandChildDoc": {
"_parent": {
"type": "childDoc"
},
"properties":{
"id":{
"type":"integer"
},
"description":{
"type":"text"
}
}
}
}
}
Indexing parentDoc:
PUT /test_index/parentDoc/1
{
"pdId":1,
"name": "First parentDoc"
}
PUT /test_index/parentDoc/2
{
"pdId":2,
"name": "Second parentDoc"
}
Indexing childDoc:
PUT /test_index/childDoc/10?parent=1
{
"cdId":10,
"name": "First childDoc",
"contact" : "+XX0000000000"
}
PUT /test_index/childDoc/101?parent=1
{
"cdId":101,
"name": "Second childDoc",
"contact" : "+XX0000000111"
}
PUT /test_index/childDoc/20?parent=2
{
"cdId":20,
"name": "Third childDoc",
"contact" : "+XX0011100000"
}
Indexing grandChildDoc:
PUT /test_index/grandChildDoc/100?parent=10
{
"gcdId":100,
"name": "First grandChildDoc"
}
PUT /test_index/grandChildDoc/200?parent=10
{
"gcdId":200,
"name": "Second grandChildDoc"
}
PUT /test_index/grandChildDoc/300?parent=20
{
"gcdId":300,
"name": "Third grandChildDoc"
}
Now when I ask elastic-search to show me those parentDoc which have childDoc, then it returns:
POST /test_index/parentDoc/_search
{
"query": {
"has_child": {
"type": "childDoc",
"query": {
"match_all": {}
}
}
}
}
Result: (This seems fine.!)
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "parentDoc",
"_id": "2",
"_score": 1,
"_source": {
"pdId": 2,
"name": "Second parentDoc"
}
},
{
"_index": "test_index",
"_type": "parentDoc",
"_id": "1",
"_score": 1,
"_source": {
"pdId": 1,
"name": "First parentDoc"
}
}
]
}
}
Now when I ask elasticsearch to show me those childDoc which have grandChildDoc, then it returns:
POST /test_index/childDoc/_search
{
"query": {
"has_child": {
"type": "grandChildDoc",
"query": {
"match_all": {}
}
}
}
}
Result: (Here, you will notice that some of the hits are missing. For example childDoc with id 10 and 101 are missing).
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "childDoc",
"_id": "20",
"_score": 1,
"_routing": "2",
"_parent": "2",
"_source": {
"cdId": 20,
"name": "Third childDoc",
"contact": "+XX0011100000"
}
}
]
}
}
Any idea what mistake I'm doing??? Or it is a bug ??? Any workaround or solution???
[Note: I'm using elasticsearch v5.4]
I have got the same working. I am using logstash to index the documents in elastic.
Root Cause:
I have explored the root cause. By default elastic assigns 5 shards and documents for one set of parent-child-grandchild must be located in the same shard. Unfortunately the data is spread across the shards. Elastic will return only those records which are there in the same shard.
Solution:
For parent-child-grandchild to work, you need to have the grand parent document id as routing value in grand child document.
For single level(Parent-child), parent value is deafult routing value which works fine. But for three level, you need to configure routing for each document in grand child.
As I have mentioned, routing value should be grand parent id.
Please find below example using logstash:
Parent
"index" => "search"
"document_type" => "parent"
"document_id" => "%{appId}"
Child: Works by default since parent/routing is same as parent document id. Routing formula (shard_num = hash(_routing) % num_primary_shards)
"index" => "search"
"document_type" => "child"
"document_id" => "%{lId}"
"parent" => "%{appId}"
Grandchild: Note Routing is appId which is grand parent document id
"index" => "search"
"document_type" => "grandchild"
"document_id" => "%{lBId}"
"parent" => "%{lId}"
"routing" => "%{appId}"
This will index all the documents to same shard and search works fine in this use case.

Elasticsearch aggregation turns results to lowercase

I've been playing with ElasticSearch a little and found an issue when doing aggregations.
I have two endpoints, /A and /B. In the first one I have parents for the second one. So, one or many objects in B must belong to one object in A. Therefore, objects in B have an attribute "parentId" with parent index generated by ElasticSearch.
I want to filter parents in A by children attributes of B. In order to do it, I first filter children in B by attributes and get its unique parent ids that I'll later use to get parents.
I send this request:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId"
}
}
}
}
And get this response:
{
"took": 91,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "child",
"_id": "AU_fjH5u40Hx1Kh6rfQG",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child2"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjD_U40Hx1Kh6rfQF",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child1"
}
},
{
"_index": "test",
"_type": "child",
"_id": "AU_fjKqf40Hx1Kh6rfQH",
"_score": 1,
"_source": {
"parentId": "AU_ffvwM40Hx1Kh6rfQA",
"name": "derp2child3"
}
}
]
},
"aggregations": {
"ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "au_ffvwm40hx1kh6rfqa",
"doc_count": 3
}
]
}
}
}
For some reason, the filtered key is returned in lowercase, hence not being able to request parent to ElasticSearch
GET http://localhost:9200/test/A/au_ffvwm40hx1kh6rfqa
Response:
{
"_index": "test",
"_type": "A",
"_id": "au_ffvwm40hx1kh6rfqa",
"found": false
}
Any ideas on why is this happening?
The difference between the hits and the results of the aggregations is that the aggregations work on the created terms. They will also return the terms. The hits return the original source.
How are these terms created? Based on the chosen analyser, which in your case is the default one, the standard analyser. One of the things this analyser does is lowercasing all the characters of the terms. Like mentioned by Andrei, you should configure the field parentId to be not_analyzed.
PUT test
{
"mappings": {
"B": {
"properties": {
"parentId": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
I am late from the party but I had the same issue and understood that it caused by the normalization.
You have to change the mapping of the index if you want to prevent any normalization changes the aggregated values to lowercase.
You can check the current mapping in the DevTools console by typing
GET /A/_mapping
GET /B/_mapping
When you see the structure of the index you have to see the setting of the parentId field.
If you don't want to change the behaviour of the field but you also want to avoid the normalization during the aggregation then you can add a sub-field to the parentId field.
For changing the mapping you have to delete the index and recreate it with the new mapping:
creating the index
Adding multi-fields to an existing field
In your case it looks like this (it contains only the parentId field)
PUT /B/_mapping
{
"properties": {
"parentId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
then you have to use the subfield in the query:
POST http://localhost:9200/test/B/_search
{
"query": {
"query_string": {
"default_field": "name",
"query": "derp2*"
}
},
"aggregations": {
"ids": {
"terms": {
"field": "parentId.keyword",
"order": {"_key": "desc"}
}
}
}
}

Elasticsearch: get multiple specified documents in one request?

I am new to Elasticsearch and hope to know whether this is possible.
Basically, I have the values in the "code" property for multiple documents. Each document has a unique value in this property. Now I have the codes of multiple documents and hope to retrieve them in one request by supplying multiple codes.
Is this doable in Elasticsearch?
Regards.
Edit
This is the mapping of the field:
"code" : { "type" : "string", "store": "yes", "index": "not_analyzed"},
Two example values of this property:
0Qr7EjzE943Q
GsPVbMMbVr4s
What is the ES syntax to retrieve the two documents in ONE request?
First, you probably don't want "store":"yes" in your mapping, unless you have _source disabled (see this post).
So, I created a simple index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
added the two docs with the bulk API:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"code":"0Qr7EjzE943Q"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"code":"GsPVbMMbVr4s"}
There are a number of ways I could retrieve those two documents. The most straightforward, especially since the field isn't analyzed, is probably a with terms query:
POST /test_index/_search
{
"query": {
"terms": {
"code": [
"0Qr7EjzE943Q",
"GsPVbMMbVr4s"
]
}
}
}
both documents are returned:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04500804,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.04500804,
"_source": {
"code": "0Qr7EjzE943Q"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.04500804,
"_source": {
"code": "GsPVbMMbVr4s"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/a3e3e4f05753268086a530b06148c4552bfce324

Resources