Elastic search deep tree model - elasticsearch

I’m researching for database tool, and i’m not quite sure how Elastic can cope with my requirements.
I have a tree data structure, a family tree.
The root is the first man Adam, and afterward his children, there children and so on.
Elements looks like this (don't care about marriage relations this data just to get the idea) :
{
id: 1
name: “Adam”
parentId: 0
}, {
id: 2
name: “Cain”
parentId: 1
}, {
id: 3
name: “Abel”
parentId: 1
}, {
id: 4
name: “johnny(Cain junior)”
parentId: 2
}, … {
id: 12324568
name: “Cain b”
parentId: 1434
}
Queries I’d like to exec:
‘full text’ search on the element name, response should include the documents and the path to them. Fof example, searching for ‘Cain’ should replay:
a. Adam/Cain
b. ../David/Danny/Cain b
CRUD person by id (Ids are unique)
Get family tree by id, will respond hierarchical tree (nested JSON) , from ‘id’ as root
Tree is about ~20-30 level deep, up to 10,000 elements
Finally, my question:
Can elasticsearch provide me this functionality?
Should i use the parent/child scheme?
How should the index mapping should look.

To answer your questions:
3) Your index mapping could look something like this:
{
"mappings": {
"my_index": {
"properties": {
"id": {
"type": "integer",
"fielddata": true <-- you need this if you're using this field for aggregations
},
"parentId": {
"type": "integer"
},
"name": {
"type": "text" <-- can be text/keyword depending on your requirement
}
}
}
}
}
2) I would suggest you to use the parent-child mapping, so that you can have a one-to-many relationship. Elasticsearch maintains a map of how parents correspond with their children, and query-time joins are fast because of this mapping. You could read up on this SO to know the benchmark of parent-child mapping over the nested.
1) You could always do a full text search as long as you have your mapping type for your field as text. This should help you on identifying the difference of using the type text over keyword. You could add a single document to your index or else you could go with a bulk adding containing multiple documents. This goes hand in hand with other CRUD operations as well. I'm still unaware how the hierarchical tree would respond when you're requesting documents by a parent id.
Hope this helps!

Related

Elastic Search indexing for many different mapping types

I have implemented something like Class and Instance logic in my application where I create an object named category which is a blue print for it's instances.
User has freedom to create as many Categories as they like with whatever fields hence I used to use one new TYPE for each category in my elastic search index mapping until it was deprecated in latest upgrades.
With latest upgrades of ES , I can think of only these 2 approaches -
creating one index for each category
keeping one object type field named after the TYPE that holds fields for
each category and keep updating this one mapping every time.
I am trying to decide on which approach to take up for ES upgrade to version 7 from 5 to keep this dynamic nature of my data modelling. Searches would be governed by TYPE string that is system generated ID for each category hence need to have grouping of fields based on the category they belong to.
OLD MAPPINGS - NOW DEPRECATED
first one - one for each TYPE(category)
{
"type_cat1" : {
"dynamic" : "strict"
"mapping" :{
"field11" : {...}
}
}
}
second one and so on
{
"type_cat2" : {
"dynamic" : "strict"
"mapping" :{
"field21" : {...}
}
}
}
}
NEW MAPPING WITH OBJECTS FOR EACH OLD TYPE
{
"mapping" :{
"properties" :{
"type_cat1" : {
"properties" :{
"field11" : {...}
}
},
"type_cat2" : {
"properties" :{
"field11" : {...}
}
}
}
}
}
ALTERNATIVE NEW MAPPING - ONE INDEX PER CATEGORY (not more than 500)
One index would be created separately for each category...
Please advice if a better approach is out there or which one to choose among these...
I have a similar use-case at my workplace where the user can create an object with any number of fields, each field can be of any datatype.
Our approach is similar to one of yours:
All categories will be mapped to a single index.
Whenever a new object is created, the index mappings are updated to accommodate the new object (a category in your case).
This is what our mappings look like when molded to your needs:
{
"mappings": {
"category": { // this is a field present in all documents
"type": "keyword"
},
"createdTime": { // this is a field present in all documents
"type": "date"
},
"id": { // this is a field present in all documents
"type": "long"
},
"fields": {
"properties": {
"type_cat1": {
"properties": {
"field1": {...},
"field2": {...}
}
},
"type_cat2": {
"properties": {
"field1": {...},
"field2": {...}
}
},
{...}
}
}
}
Get all records of a certain category:
"category": "cat1"
Get all records of cat1 where field2 == "dummy_value"
"category": "cat1 AND "fields.cat1.field2.keyword": "dummy_value"
When a new category is created, the fields part of our mappings get updated.
Extracting out the common fields (category, createdTime, id) eliminates redundancy in mappings.
Some worthy points:
As the number of unique categories is only 500, you can also go with a separate index per category. This is more beneficial if there are going to be many records (> 1,00,000) per category.
If the categories are sparse in nature (each category has less number records), then ES can easily handle everything in a single index.
If we assume 50 fields per category on average, then the total fields in the single index approach will be 50*500 = 25000. This is a manageable number.
Of course, in the end, many things will depend upon resources allocated to the cluster.

Indexing strategy for hierarchical structures on ElasticSearch

Let's say I have hierarchical types such as in example below:
base_type
child_type1
child_type3
child_type2
child_type1 and child_type2 inherit metadata properties from base_type. child_type3 has all properties inherited from both child_type1 and base_type.
To add to the example, here's several objects with their properties:
base_type_object: {
base_type_property: "bto_prop_value_1"
},
child_type1_object: {
base_type_property: "ct1o_prop_value_1",
child_type1_property: "ct1o_prop_value_2"
},
child_type2_object: {
base_type_property: "ct2o_prop_value_1",
child_type2_property: "ct2o_prop_value_2"
},
child_type3_object: {
base_type_property: "ct3o_prop_value_1",
child_type1_property: "ct3o_prop_value_2",
child_type3_property: "ct3o_prop_value_3"
}
When I query for base_type_object, I expect to search base_type_property values in each and every one of the child types as well. Likewise, if I query for child_type1_property, I expect to search through all types that have such property, meaning objects of type child_type1 and child_type3.
I see that mapping types have been removed. What I'm wondering is whether this use case warrants indexing under separate indices.
My current line of thinking using example above would be to create 4 indices: base_type_index, child_type1_index, child_type2_index and child_type3_index. Each index would only have mappings of their own properties, so base_type_index would only have base_type_property, child_type1_index would have child_type1_property etc. Indexing child_type1_object would create an entry on both base_type_index and child_type1_index indices.
This seems convenient because, as far as I can see, it's possible to search multiple indices using GET /my-index-000001,my-index-000002/_search. So I would theoretically just need to list hierarchy of my types in GET request: GET /base_type_index,child_type1_index/_search.
To make it easier to understand, here is how it would be indexed:
base_type_index
base_type_object: {
base_type_property: "bto_prop_value_1"
},
child_type1_object: {
base_type_property: "ct1o_prop_value_1"
},
child_type2_object: {
base_type_property: "ct2o_prop_value_1",
},
child_type3_object: {
base_type_property: "ct3o_prop_value_1",
}
child_type1_index
child_type1_object: {
child_type1_property: "ct1o_prop_value_2"
},
child_type3_object: {
child_type1_property: "ct3o_prop_value_2",
}
I think values for child_type2_index and child_type3_index are apparent, so I won't list them in order to keep the post length at a more reasonable level.
Does this make sense and is there a better way of indexing for my use case?

Having values as keys VS having them as a nested object array in ElasticSearch

Currently , I have a elasticsearch index with a field that has subfields like say A,B,C as below:
"myfield":{
"A":{
"name":"A",
"prop1":{
"sub-prop1":1,
"sub-prop2":2
},
"prop2":{}
},
"B":{
"name":"B",
"prop1":{
"sub-prop1":3,
"sub-prop2":8,
"sub-prop3":4,
"sub-prop4":7,
},
"prop2":{}
},
"C":{}
}
As can be seen, the structure of A and B fields are same, but the sub-props under the prop1 can be dynamic , meaning based on the documents added, the mapping might change but its not an issue as A and B exist as separate keys.However, because of this I am facing another problem, in that keeping on adding new documents, due to dynamic mapping, its possible that such sub-props or sub-fields like A,B,C,D ... and so on keep getting added to the mapping, which in turn might cause the mapping to exceed the index.mapping.total_fields.limit ,so to avoid that I am planning to make "myfield" and "prop1" fields as array of objects instead in the mapping, so that the fields A,B,C... are stored as array elements instead of keep getting added to the mapping as new fields.
The question is - is this a feasible solution and how to search for say, "myfield.A.prop1.sub-prop1" >= 3
the new mapping looks something like:
"myfield":[
{
"name":"A",
"prop1":{
"sub-prop1":1,
"sub-prop2":2
},
"prop2":{}
},
{
"name":"B",
"prop1":{
"sub-prop1":3,
"sub-prop2":8,
"sub-prop3":4,
"sub-prop4":7,
},
"prop2":{}
},
{}
]

Elasticsearch & X-Pack: how to get vertices/connections from nested documents

I just started using X-Pack for Elasticsearch and want to connect vertices from a nested document type. However, looking for documentation on this hasn't got me anywhere.
What I have is an index of documents which have person names/ids as nested documents (one document can have many persons, one person can be related to many documents). The desired result is to get a graph data with connections between persons.
Does anyone have a clue or can tell me if this is even possible?
Part of my mappings:
mappings: {
legend: {
properties: {
persons: {
type: 'nested',
properties: {
id: {
type: 'string',
index: 'not_analyzed'
},
name: {
type: 'string',
index: 'not_analyzed'
}
}
}
}
}
}
And my Graph API query, which of course doesn't work because I don't know how to handle the "name" field of the nested "persons" field.
POST sagenkarta_v3/_xpack/_graph/_explore
{
"controls": {
"use_significance": true,
"sample_size": 20000,
"timeout": 2000
},
"vertices": [
{
"field": "persons.name"
}
],
"connections": {
"vertices": [
{
"field": "persons.name"
}
]
}
}
Thanks in advance!
The following question was discussed here:
https://discuss.elastic.co/t/elasticsearch-x-pack-how-to-get-vertices-connections-from-nested-documents/88709
quote from Mark_Harwood - Elastic Team Member:
Unfortunately Graph does not support nested documents but you can use
copy_to in your mappings to put the person data in an indexed field in
the containing root document.
I can see that you have the classic problem of
"computers-want-IDs-but-people-want-labels" and have both these
values. In Graph (and arguably the rest of Kibana too) I suggest you
use tokens that combine IDs for uniqueness' sake and names for
readability by humans.
The copy_to and IDs-and-labels tips are part of the modelling
suggestions in my elasticon talk this year:
https://www.elastic.co/elasticon/conf/2017/sf/getting-your-data-graph-ready
3

Using elastic search to build flow/funnel results based on unique identifiers

I want to be able to return a set of counts of individual documents from a single index based on a previous set of results, and am wondering if there is a way to do it without running a separate query for each.
So, given a data set like this (simplified version of my ES documents):
{
"name": "visit",
"sessionId": "session1"
},
{
"name": "visit",
"sessionId": "session2"
},
{
"name": "visit",
"sessionId": "session3"
},
{
"name": "click",
"sessionId": "session1"
},
{
"name": "click",
"sessionId": "session3"
}
What I would like to do is be able to search for name: visit and give a count of all those. That part is easy. But I would also like to be able to now count my name: click docs that have the sessionId of the name: visit result set and return a count of how many of those name: click there were as well as the name: visit.
Is there an easy way to do this? I have looked at aggregation APIs but they all seem to not quite fit my needs. There also seems to be a parent/child relationship but it doesn't apply to my situation since both documents I want to individually get counts of are of the same type.
Expected result would be something like this:
{
"count": {
// total number of visit events since this is my start point
"visit": 3,
// the amount of click results that have sessionId
// matching my previous search's sessionId values
"click": 2
}
}
At first glance, you need to do this in two queries:
the first aggregation query to retrieve the sessionIds and
a second aggregation query filtered with those sessionIds to find the count of clicks.
I don't think it's a big deal to run those two queries, but that depends on how much data you have and how many sessionIds you want to retrieve at once.

Resources