How to save persistent data structure into elastic search - elasticsearch

I have a relational DB, and it saved a persistent segment tree structure, means all parent and children are always manyToMany, the height of this tree is about 4 to 5 levels
Persistent segment tree reference: https://www.geeksforgeeks.org/persistent-segment-tree-set-1-introduction/
Because it is segment tree, node data won't change much, when it changed, there should be another node generated
My questions is which way I should save my data into elastic search.
My current thought:
nested will build too many documents, anything changes from leaf node will generate a new root, and in new root, will make a lot of redundant data, because a lot of un-versioned nodes are not changed, and it is hard to trace node versioning
parent-child mode looks good fit, however each node may have multiple parents and children,
this many-to-many relationship is almost everywhere from my DB, denormalize it seems impossible, the solution that I could think of is maintaining children and parents IDs from each level, but it makes so hard to maintain, plus join might make it slow
Please enlight me with your ideas.

Related

Elastic Search - Joins best practises

I come across the following as part of docuementation
In Elasticsearch the key to good performance is to de-normalize your data into documents
And also,
the restriction about, where both the child and parent documents must be on the same shard
Given a scenario of multilevel hiearchy( grandparent --> parent ---> child ), where some of the parents have more childern than other and data might be skewed and few shards contain exponetially larger data than other shards.
What are the best practises with respect to gain more performance ?
Is it a good idea to put all the hiearchy in a single document ( rather than one document for each level). The parent data might be redudant if there are more childern as the parent data need to be copied to all the documents ?
Yes, both the statements which you mentioned are correct, and let me answer your both question in the context of your use-case.
Is it a good idea to put all the hierarchy in a single document (rather than one document for each level). The parent data might be redundant if there are more children as the parent data need to be copied to all the documents?
Answer: In general, if you have all the data in a single document searching, definitely searching will be much faster and that's the whole reason for denormalizing the data in databases which is also mentioned in the first statement, as you don't have to create multiple workers thread and combine the results from multiple documents/shards/nodes. also storage is cheap and although it will save the storage cost but save the computing cost(costlier than storage). in short, if you are worried about query performance than de-normalizing your data will give it a major boost.
What are the best practices with respect to gain more performance?
Answer: if you still go ahead with the normalization approach, then as mentioned you should keep all the related docs in the same shard and should implement custom routing to achieve that.

Deserializing parts of a serialized whole; Deserializing object contents without deserializing the entire object

I have an algorithm that requires behavior that I have not previously seen. While traversing a tree that represents a file structure, I want to be able to serialize entire subtrees once they are built (all nodes populated), and store them away to a file or app data for later use.
Hard thing is, I want to build the system in such a way that allows me to deserialize only an individual node in the tree structure, without having to deserialize the whole tree that was initially stored away.
For context, the algorithm is intended for my application to use only when I am building file structures of extremely large file systems, with millions of files and directories. The structure becomes so large that it's too costly to keep millions of representative file nodes loaded into memory. I need to be able to store parts or all of the tree away as I go so that I can reduce the memory foot print, however I still need to be able to retrieve information about any part of the tree at any given time, whether it has been serialized and stored away already or not. If at any point I find that a node references another node in the tree structure, I need to be able to do a look up on that node and deserialize or retrieve that already existing node, without having to unpack the whole tree.
This one is hurting my brain.
A good way to structure a file storing a filesystem tree is a list of (path, serialized node data) pairs sorted by the path. You can then lookup a node using binary search. If you want to quickly load all children of a given node, you should use a different key: (number of path components, path, serialized node data). Then all children of a node will be stored in a continuous range of the file.
I haven't tried it myself but it seems that LevelDB implements such a storage.

Suitable tree data structure

Which is the most suitable tree data structure to model a hierarchical (containment relationship) content. My language is bit informal as I don't have much theoretical background on these
Parent node can have multiple children.
Unique parent
Tree structure is rarely changed, os it is ok to recreate than add/rearrange nodes.
Two way traversal
mainly interested in, find parent, find children, find a node with a unique id
Every node has a unique id
There might be only hundreds of nodes in total, so performance may not be big influence
Persistence may be good to have, but not necessary as I plan to use it in memory after reading the data from DB.
My language of choice is go (golang), so available libraries are limited. Please give a recommendation without considering the language which best fit the above requirement.
http://godashboard.appspot.com/ lists some of the available tree libraries. Not sure about the quality and how active they are. I read god about
https://github.com/petar/GoLLRB
http://www.stathat.com/src/treap
Please let know any additional information required.
Since there are only hundreds of nodes, just make the structure the same as you have described.
Each node has a unique reference to parent node.
Each node has a list of child node.
Each node has an id
There is a (external) map from id --> node. May not even necessary.
2 way traversal is possible, since parent and child nodes are know. Same for find parent and find child.
Find id can be done by traverse the whole tree, if no map. Or you can use the map to quickly find the node.
Add node is easy, since there is a list in each node. Rearrange is also easy, since you can just freely add/remove from list of child nodes and reassign the parent node.
I'm answering this question from a language-agnostic aspect. This is a tree structure without any limit, so implementation is not that popular.
I think B-Tree is way to go for your requirements. http://en.wikipedia.org/wiki/B-tree
Point 1,2,3: B-Tree inherently support this. (multiple children, unique parent, allows insertion/deletion of elements
Point 4,5: each node will have pointers for its child by default implementation . Additionally you can maintain pointer of parent for each node. you can implement your search/travers operations with BFS/DFS with help of these pointers
Pomit 6: depends on implementation of your insert method if you dont allow duplicate records
Pont 7,8: not a issue as for you have mentioned that you have only hundreds of records. Though B-Trees are quite good data structure for external disk storage also.

Persisting a hierarchical ordered list (flatfile/sql/nosql)

I want to store hierarchical ordered lists. One example would be nested todo lists. Another example would be XML. It would just be a tree where the children are in a order. For simplicity, entries are just strings of text.
The thing is that the list will be edited by the user, so it is important that the common operations are fast:
Edit an element
Delete an element
Insert an entry before another
I can imagine how to do this in a data structure: entries are linked lists, if they contain children, they also point to the head of another linked list. There is a hash table linking entry id to the actual data.
Editing is looking up the hash and then replacing the data part of the linked list
Deletion is looking up the hash and doing linked list deletion
Insertion is looking up the hash and doing linked list insertion
However, I need to store the data, and I have no idea how to achieve this. I don't want to save the entire tree if only one element changes. What is the best way? Flat files/SQLs/NoSqls/voodoos?
Using a relational database is viable solution. For your needs - fast insert, update, delete - I'd use an Adjacency List with an additional customizations as such:
id
parent_id
cardinality -- sort order for all nodes with the same parent_id
depth -- distance from the root node
Calculating cardinality and depth is either done with code or - preferably - a database trigger for any insert, delete or update. In addition, for retrieving an entire hierarchy with one SELECT statement, a hierarchy bridge table is called for:
id
descendent_id
This table would also be populated via the same trigger mentioned above and serves as a means for retrieving all nodes above or beneath a given id.
See this question for additional detail around Adjacency List, Hierarchy Bridge and other approaches for storing hierarchical data in a relational database.
Finally to provide some additional clarification on the options you listed:
Flat Files: a combination of linked lists and memory mapped files would probably serve, but you're really just rolling your own at that point, where a SQL or NoSQL solution would probably do better.
SQL: this would be my approach - tooling is the best here for data manipulation, backup and recovery.
XML: this is also a possibility with a database, very vendor specific, you'll need to study the syntax for node insert, update and delete. Can be very fast if the database offers an XML data type.
NoSQL: if you're talking key-value storage, the typical approach for hierarchical data appears to be materialized path, but this would require recalculating the entire path for all affected nodes on change, which is likely slow. Instead consider the Java Content Repository (JCR) - Apache Jackrabbit is an implementation - entire API centered around representing hierarchical structured data and persisting it - perhaps too heavyweight for the problem you're trying to solve.
voodoo: um...
Update
If you implement all pieces from this answer, add is cheap, re-sort is small cost, move is expensive. Trade-off is fast hierarchy traversal reads - for instance find a node's complete ancestry in one operation. Specifically, adding a leaf is an O(1) operation. Re-sort means updating cardinality all peer nodes coming after the moved node. Move means update of (1) cardinality for source and destination peer nodes coming after, (2) moved - and descendant - node depth, and (3) removal and addition of ancestry to hierarchy bridge table.
However, go with an Adjancency List alone (i.e. id, parent_id) and write becomes cheap, reads for one level are cheap, but reads that traverse the hierarchy are expensive. The latter would then require using recursive SQL such Oracle's CONNECT BY or Common Table Expressions as found in SQL Server and other RDBMSs.
You store lists (or rather trees) and don't want to rewrite the entire tree once a small piece of it changes. From this I conclude the stuctures are huge and small changes happen relatively often.
Linked lists are all about pointer chasing, and pointers and what they reference are much like keys and values. You need to efficiently store key-value pairs. Order of items is preserved by the linked list structure.
Suppose that you use a typical key-value store, from xDBM or Berkeley DB to any of modern NoSQL offerings. Also you could take a compact SQL engine, e.g. sqlite. They typically use trees to index keys, so it takes O(logN) to access a key, or hash tables that take about as much or a bit less.
You haven't specified when you persist your data incrementally. If you only do it once in a while (not for every update), you'll need to effectively compare the database to your primary data structure. This will be relatively time-consuming because you'll need to traverse the entire tree and look each node ID in the database. This is logarithmic but with a huge constant because of necessary I/O. And then you'll want to clean you persistent store from items that are no longer referenced. It may happen that just dumping the tree as JSON is far more efficient. In fact, that's what many in-memory databases do.
If you update your persistent structure with every update to the main structure, there's no point to have that main structure anyway. It's better to replace it with an in-memory key-value store such as Redis which already has persistence mechanisms (and some other nice things).

Store hierarchies in a way that is resistant to corruption

I was thinking today about the best way to store a hierarchical set of nodes, e.g.
(source: www2002.org)
The most obvious way to represent this (to me at least) would be for each node to have nextSibling and childNode pointers, either of which could be null.
This has the following properties:
Requires a small number of changes if you want to add in or remove a node somewhere
Is highly susceptible to corruption. If one node was lost, you could potentially lose a large amount of other nodes that were dependent on being found through that node's pointers.
Another method you might use is to come up with a system of coordinates, e.g. 1.1, 1.2, 1.2.1, 1.2.2. 1.2.3 would be the 3rd node at the 3rd level, with the 2nd node at the prior level as its parent. Unanticipated loss of a node would not affect the ability to resolve any other nodes. However, adding in a node somewhere has the potential effect of changing the coordinates for a large number of other nodes.
What are ways that you could store a hierarchy of nodes that requires few changes to add or delete a node and is resilient to corruption of a few nodes? (not implementation-specific)
When you refer to corruption, are you talking about RAM or some other storage? Perhaps during transmission over some medium?
In any case, when you are dealing with data corruption you are talking about an entire field of computer science that deals with error detection and correction.
When you talk about losing a node, the first thing you have to figure out is 'how do I know I've lost a node?', that's error detection.
As for the problem of protecting data from corruption, pretty much the only way to do this is with redundancy. The amount of redundancy is determined by what limit you want to put on the degree of corruption you would like to be able to recover from. You can't really protect yourself from this with a clever structure design as you are just as likely to suffer corruption to the critical 'clever' part of your structure :)
The ever-wise wikipedia is a good place to start: Error detection and correction
I was thinking today about the best way to store a hierarchical set of nodes
So you are writing a filesystem? ;-)
The most obvious way to represent this (to me at least) would be for each node to have nextSibling and childNode pointers
Why? The sibling information is present at the parent node, so all you need is a pointer back to the parent. A doubly linked-list, so to speak.
What are ways that you could store a hierarchy of nodes that requires few changes to add or delete a node and is resilient to corruption of a few nodes?
There are actually two different questions involved here.
Is the data corrupt?
How do I fix corrupt data (aka self healing systems)?
Answers to these two questions will determine the exact nature of the solution.
Data Corruption
If your only aim is to know if your data is good or not, store a hash digest of child node information with the parent.
Self Healing Structures
Any self healing structure will need the following information:
Is there a corruption? (See above)
Where is the corruption?
Can it be healed?
Different algorithms exist to fix data with varying degree of effectiveness. The root idea is to introduce redundancy. Reconstruction depends on your degree of redundancy. Since the best guarantees are given by the most redundant systems -- you'll have to choose.
I believe there is some scope of narrowing down your question to a point where we can start discussing individual bits and pieces of the puzzle.
A simple option is to store reference to root node in every node - this way it is easy to detect orphan nodes.
Another interesting option is to store hierarchy information as a descendants (transitive closure) table. So for 1.2.3 node you'd have the following relations:
1., 1.2.3. - root node is ascendant of 1.2.3.
1.2., 1.2.3. - 1.2. node is ascendant of 1.2.3.
1., 1.2. - root node is ascendant of 1.2.
etc...
This table can be more resistant to errors as it holds some redundant info.
Goran
The typical method to store a hierarchy, is to have a ParentNode property/field in every node. For root the ParentNode is null, for all other nodes it has a value. This does mean that the tree may lose entire branches, but in memory that seems unlikely, and in a DB you can guard against that using constraints.
This approach doesn't directly support the finding of all siblings, if that is a requirement I'd add another property/field for depth, root has depth 0, all nodes below root have depth 1, and so on. All nodes with the same depth are siblings.

Resources