ElasticSearch self-referential structure, orphaned children - elasticsearch

We have one index with one type, clients/client, and clients can be self-referential in a parent-child hierarchy (but not using ES parent-child, as that doesnt support a self-referential structure).
We are considering using nested for this, but the hierarchy is potentially endless, which makes nested queries a bit of a hassle, or maybe even impossible.
What we would want to find is primarily all top-level parents, so we build our searchQuery by filtering/searching for all elements that dont have a reference to parent (a simple term value with the parent id). Also, we save a reference to each elements children inside of that element, a list of children IDs, so that we can do subsequent requests in the frontend when the user sees that element, for a hierarchical visualization.
However, the thing that gives us a headache is: how do we, without post-processing, find children elements, where the parent WASN'T found, ie orphaned children, so that they dont get lost in the process? Because the above described query, finding top-level parents that each find their own children, doesnt work, if the search query matches ONLY a child element. The only idea we have is doing a second request for this, but that destroys the score sorting. We have been toying with many ideas, but have fallen short of finding a one-request-elasticsearch-solution for this issue. Is there such a thing?
our data looks something like below, but of course we can save the entire tree in each element. The question is, which is the best approach.
"hits": {
"total": 5,
"max_score": 1,
"hits": [
{
"_index": "clientsv3",
"_type": "client",
"_id": "5",
"_score": 1,
"_source": {
"name": "Client 2 sub2",
"country": "Belgium",
"parentId": 2
}
},
{
"_index": "clientsv3",
"_type": "client",
"_id": "2",
"_score": 1,
"_source": {
"name": "Client 2",
"country": "France",
"children": [
3,
5
]
}
},
{
"_index": "clientsv3",
"_type": "client",
"_id": "4",
"_score": 1,
"_source": {
"name": "Client 2 sub sub",
"country": "Germany",
"parentId": 3
}
},
{
"_index": "clientsv3",
"_type": "client",
"_id": "1",
"_score": 1,
"_source": {
"name": "Client 1",
"country": "Germany"
}
},
{
"_index": "clientsv3",
"_type": "client",
"_id": "3",
"_score": 1,
"_source": {
"name": "Client 2 sub",
"country": "Germany",
"children": [
4
],
"parentId": 2
}
}
]
}

Related

Elasticsearch search for a child and all his sibling documents grouped by parent

I would like to be able to submit a query which matches on child documents and returns the parent and all his child documents.
I have parent and child documents in my Elasticsearch index related through a join: https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html?baymax=rec&rogue=rec-1&elektra=guide.
I have items divided into groups, each item in my index is a separate child document(NOTE: It's required to be able search children separately by different query, so I can NOT use Nested objects). The parent document contains a few meaningful fields like (name, sku, image) so it's required to get Parent along with its children.
I've achieved my requirements using following query:
GET my_index/_search
{
"query": {
"has_child": {
"type": "child",
"query": {
"has_parent": {
"parent_type": "parent",
"query": {
"has_child": {
"type": "child",
"query": {
"multi_match": {
"query": "NV1540JR",
"fields": [
"name",
"sku"
]
}
}
}
}
}
},
"inner_hits": {}
}
}
}
It's returns following result, which is exactly what I need:
{
"took": 301,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9GEAT",
"_score": 1.0,
"_source": {
"id": "Az9GEAT",
"name": "Gold Calacatta 2.0",
"sku": "NV1540",
"my_join-field": "parent"
},
"inner_hits": {
"child": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "zx9EEAR",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "zx9EEAR",
"name": "Gold Calacatta 12\" x 24\"",
"sku": "NV1540M-2",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9NEAT",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "Az9NEAT",
"name": "Gold Calacatta 2.0, 24\" x 48\"",
"sku": "NV1540JR",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
}
]
}
}
}
}
]
}
}
In other way I could implement Application-side Join by making three different query calls(one to get all matching data, second to get siblings, third to get parents) and combining result in my Application. But not sure that it gonna be faster, cos of http request time and data processing time.
So, I'm a very newbee in elasticsearch and can't estimate how bad it is. How does it's affects the query performance? If there any other ways to get desired result? Or how my query could be improved? I'd be glad to hear any suggestions or thoughts! Thanks
For ES it's a standard practice to retrieve a list of object ids & performs a second request to return a complete document set.
You can implement your logic using 2 queries
Request (1) all documents satisfying your child search criteria. Select only child.id & child.parent_id fields to ensure you load only index data, no document _source searched. Request will be relatively fast
In your application code determine unique list of parent_ids & orphaned_child_ids
Request (2) all documents satisfying criteria: parent_id in parent_ids OR parent_id = NULL AND child_id in orphaned_child_ids

Kibana Visualization from multiple Elastic Search Indexes

I have a requirement to find the numbers of mobile applications registered by the customer. The Elastic Search index is designed as below (Mobile App in one index, Customers in one index and the association between both in 3rd index). When I created the Kibana Indexpattern for these 3 indices together, it does not provide meaningful/valid set of fields to query them.
mobile_users
{
"_index": "mobile_users",
"_type": "_doc",
"_id": "mobileuser_id1",
"_score": 1,
"_source": {
"userid": "mobileuser_id1",
"name": "jack",
"username": "jtest",
"identifiers": [ ],
"contactEmails": [ ],
"creationDate": "2020-09-29 09:18:36 GMT",
"lastUpdated": 1601371117354,
"isSuspended": false,
"authStrategyIds": [ ],
"subscription": false
}
}
mobile_applications
{
"_index": "mobile_applications",
"_type": "_doc",
"_id": "mobileapp_id1",
"_source": {
"appDefinition": {
"info": {
"version": "1.0",
"title": "TEST.MobileAPP"
},
"AppDisplayName": "TEST.MobileAPP1.0",
"appName": "TEST.MobileAPP",
"appVersion": "1.0",
"maturityState": "Test",
"isActive": false,
"owner": "mobileappowner",
"creationDate": "2020-09-24 11:21:44 GMT",
"lastModified": "2020-10-13 11:58:22 GMT",
"id": "mobileapp_id1"
}
registered_mobile_applications
{
"_index": "registered_mobile_applications",
"_type": "_doc",
"_id": "mobileuser_id1",
"_version": 1,
"_score": 1,
"_source": {
"applicationId": "mobileuser_id1",
"mobileappIds": [
"mobileapp_id1", "mobileapp_id2"
],
"lastUpdated": 1601371117929
}
}
Can you advise if there is any way to get the count of registered applications for the given customer?
it's Elasticsearch, not Elastic Search :)
given each of your document structures are dramatically different, it's not surprising you can't get much meaning from a single index pattern
however there's no way to natively count the values of an array in a document in Kibana. you could create a scripted field that should do it, or add that as a separate field during ingestion

Parsing source fields in a SearchResult

{
"hits": {
"total": 4,
"max_score": 12.914036,
"hits": [
{
"_index": "cars",
"_type": "sports",
"_id": "359809062-169200612195",
"_score": 12.914036,
"_source": {
"uniqueId": "35980",
"productName": "Tesla",
"Year": "2008"
}
},
{
"_index": "cars",
"_type": "sports",
"_id": "359809061-169200612191",
"_score": 11.914036,
"_source": {
"uniqueId": "33980",
"productName": "Ferrari",
"Year": "2015"
}
}
]
}
}
How to parse all the _source fields? Trying to return a list which contain only the _source fields in the hits.
val searchHits = searchResult.getHits(classOf[Object]).toList
searchHits.map(hit => {
CarDetails(
hit.source.get("uniqueId").getAsString(),
hit.source.get("productName").getAsString(),
hit.source.get("productName").getAsString(),
})
}
For this code, i get the error: value get is not a member of Object which is kind of expected.
Trying to parse the result without defining a model. Possible?
I'm Jest Client with Scala.
Context:
CarDetails is a case class. Basically, what i'm trying to do is parse the hits (only the items inside _source) and return a list of CarDetails objects to the method that calls this function.

ElasticSearch query with conditions on multiple documents

I have data of this format in elasticsearch, each one is in seperate document:
{ 'pid': 1, 'nm' : 'tom'}, { 'pid': 1, 'nm' : 'dick''},{ 'pid': 1, 'nm' : 'harry'}, { 'pid': 2, 'nm' : 'tom'}, { 'pid': 2, 'nm' : 'harry'}, { 'pid': 3, 'nm' : 'dick'}, { 'pid': 3, 'nm' : 'harry'}, { 'pid': 4, 'nm' : 'harry'}
{
"took": 137,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": null,
"hits": [
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KS86AaDUbQTYUmwY",
"_score": null,
"_source": {
"pid": 1,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KJ9BAaDUbQTYUmwW",
"_score": null,
"_source": {
"pid": 1,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KRlbAaDUbQTYUmwX",
"_score": null,
"_source": {
"pid": 1,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KYnKAaDUbQTYUmwa",
"_score": null,
"_source": {
"pid": 2,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KXL5AaDUbQTYUmwZ",
"_score": null,
"_source": {
"pid": 2,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KbcpAaDUbQTYUmwb",
"_score": null,
"_source": {
"pid": 3,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9Kdy5AaDUbQTYUmwc",
"_score": null,
"_source": {
"pid": 3,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KetLAaDUbQTYUmwd",
"_score": null,
"_source": {
"pid": 4,
"nm": "Harry"
}
}
]
}
}
And I need to find the pid's which have 'harry' and do not have 'tom', which in the above example are 3 and 4. Which essentialy means look for the documents having same pids where none of them has nm with value 'tom' but at least one of them have nm with value 'harry'.
How do I query that?
EDIT: Using Elasticsearch version 5
What if you have a POST request body which could look something like below, where you might use bool :
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "nm" : "harry" }
},
"must_not" : {
"term" : { "nm" : "tom" }
}
}
}
}
I am relatively very new in Elasticsearch, so I might be wrong. But I have never seen such query. Simple filters can not be used here as those are applied on a doc (and not aggregations) which you do not want. What I see is you want to do a "Group by" query with "Having" clause (in terms of SQL). But Group by queries involve some aggregation (like avg, max, min of any field) which is used in "Having" clause. Basically you use a reducer for Post processing of aggregation results. For queries like this Bucket Selector Aggregation can be used. Read this
But your case is different. You do not want to apply Having clause on any metric aggregation but you want to check if some value is present in field (or column) of your "group by" data. In terms of SQL, you want to do a "where" query in "group by". This is what I have never seen. You can also read this
However, at application level, you can easily do this by breaking your query. First find unique pid where nm= harry using term aggs. Then get docs for those pid with additional condition nm != tom.
P.S. I am very new to ES. And I will be very happy if any one contradicts me show ways to do this in one query. I will also learn that.

Does the elasticsearch ID have to be unique to a type or to the index?

Elasticsearch allows you to store a _type along with the _index. I was wondering if I were to provide my own _id should it be unique across the index?
It should be unique together
PUT so
PUT /so/t1/1
{}
PUT /so/t2/1
{}
GET /so/_search
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "t2",
"_id": "1",
"_score": 1,
"_source": {}
},
{
"_index": "so",
"_type": "t1",
"_id": "1",
"_score": 1,
"_source": {}
}
]
}
}
And the reason for that: you'd never get documents by index w/o knowing doctype, and querying ES with index-wide query will return documents including their types and indexes.
Absolutely, there are a few ways of doing it.
The first is using the PUT API, which allows us to specify an ID for a document. So, for the index index and the type type:
curl -XPUT "http://localhost:9200/index/type/1/" -d'
{
"test":"test"
}
Which gives me this document:
{
"_index": "index",
"_type": "type",
"_id": "1",
"_score": 1,
"_source": {
"test": "test"
}
}
Another way is to route the ID to a unique field in your mapping. For example, an md5 hash. So, for an index called index with a type called type, we can specify the following mapping:
curl -XPUT "http://localhost:9200/index/_mapping/type" -d'
{
"type": {
"_id":{
"path" : "md5"
},
"properties": {
"md5": {
"type":"string"
}
}
}
}
This time, I'm going to use the POST API, which automatically generates an ID. If you haven't specified a path in your mapping, it will automatically generate one for you.
curl -XPOST "http://localhost:9200/index/type/" -d'
{
"md5":"00000000000011111111222222223333"
}'
Which gives me the following document in a search:
{
"_index": "index",
"_type": "type",
"_id": "00000000000011111111222222223333",
"_score": 1,
"_source": {
"md5": "00000000000011111111222222223333"
}
}
The second method is generally preferred, because it provides consistency across the index. A perfectly valid id for an index could be 1 like in the example, or dog in another case.

Resources