Easy tree traversal and fast random node access - data-structures

Edited after Alex Taggart's remark below.
I am using a zipper to easily traverse and edit a tree which can grow to many thousands of nodes. Each node is incomplete when it is first created. Data is going to be added/removed all the time in random positions, leaf nodes are going to be replaced by branches, etc.
The tree can be very unbalanced.
Fast random access to a node is also important.
An implementation would be to traverse the tree using a zipper and create a hash table of the nodes indexed by key. Needless to say the above would be very inefficient as:
2 copies of each node need to be created
any changes need to be consistently mirrored between the 2 data structures (tree and hashmap).
In short, is there a time/space efficient way to combine the easiness of traversing/updating with a zipper and the fast access of a hash table in clojure?

Clojure's data structures are persistent and use structural sharing. This means that operations like adding, removing or accumulating are not as inefficient as you describe. The memory cost will be minimal since you are not duplicating what's already there.
By default Clojure's data structures are immutable. The nodes in your tree like structure will thus not update themselves unless you use some sort of reference type (like a Var). I don't know enough about your specific use case to advice on the best way to access nodes. One way to access nodes in a nested structure is the get-in function where you supply the path to the node to return its value.
Hope this helps solving your problem.

Related

Beneftis of Hybrid Data Structures on Efficiency

I have this homework assignment in my Computer Science class that involves combining different data structures for apparent increased efficiency
TL;DR --- Scroll Down
""""Build a data structure which behaves like a linked list with a binary tree as an indexing structure. It should be able to be used as a linked list and inherited from to construct indexed queues and indexed stacks. You may assume that all things that will be put into this data structure are Comparable, so that the indexing tree will function as a binary search tree. You should build a class of iterators to facilitate interaction with this data structure. Insertion into the list can be done 'after' a location specified by a list iterator (which could sometimes be returned by a find method). Naturally, in an inherited indexed queue, insertion will only be at the back of the queue, however the indexing via the tree will need to preserve the binary search tree ordering, and similarly for an inherited indexed stack. You should have methods to insert and delete, and methods to find (returning an iterator) and sort (any sorting technique will suffice for this question, though you might well want to take advantage of the inherent ordering information derived from the tree!!). Test this structure using a main method which plays with people (perhaps compared via height?).""""
TL;DR --- What are the benefits of having Binary Search Tree nodes containing the same Objects as doubly linked list nodes?
Also, how would inheritance work with such a list?
What are the benefits of having Binary Search Tree nodes containing the same Objects as doubly linked list nodes?
Perhaps a better way of asking the same question would be "what are the benefits of connecting the nodes of a Binary Search Tree (BST) with additional links to construct a linked list out of the same nodes?"
The benefit of adding an extra link is an ability to iterate over the entire tree using O(1) memory. Without this additional link you would need O(Log(N)) memory to iterate the tree, because you would need to keep the position at each level.
The "payment" for this is the use of additional O(N) blocks of memory for the links, and a somewhat more complex algorithm for maintaining the data structure. This may be a fair deal when you iterate the same tree a significant number of times, while insertions and modifications are generally rare.
How would inheritance work with such a list?
Rather than inheriting from a list and also from a tree, you would implement interfaces for the list and for the tree.

How to calculate that a B+ tree is O(log(n)) for lookups

I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?
What are some other uses of B+ trees besides database indexing?
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct?
No, the index is formed by the inner nodes (non-leaves). Depending on the implementation the leaves may contain either key/value pairs or key/pointer to value pairs. For example, a database index uses the latter, unless it is an IOT (Index Organized Table) in which case the values are inlined in the leaves. This depends mainly on whether the value is insanely large wrt the key.
Then how are lookups made?
In the general case where the root node is not a leaf (it does happen, at first), the root node contains a sorted array of N keys and N+1 pointers. You binary search for the two keys S0 and S1 such that S0 <= K < S1 (where K is what you are looking for) and this gives you the pointer to the next node.
You repeat the process until you (finally) hit a leaf node, which contains a sorted list of key-values pairs and make a last binary search pass on those.
If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
Binary trees are simpler to implement. One though cookie with B+Trees is to size the number of keys/pointers in inner nodes and the number of key/values pairs in leaves nodes. Another though cookie is to decide on the low and high watermark that leads to grouping two nodes or exploding one.
Binary trees also offer memory stability: an element inserted is not moved, at all, in memory. On the other hand, inserting an element in a B+Tree or removing one is likely to lead to elements shuffling
B+Trees are tailored for small keys/large values cases. They also require that keys can be duplicated (hopefully cheaply).
Could you guide me perhaps with some link to reading material?
I hope the rough algorithm I explained helped out, otherwise feel free to ask in the comments.
What are some other uses of B+ trees besides database indexing?
In the same vein: file-system indexing also benefits.
The idea is always the same: a B+Tree is really great with small keys/large values and caching. The idea is to have all the keys (inner nodes) in your fast memory (CPU Cache >> RAM >> Disk), and the B+Tree achieves that for large collections by pushing keys to the bottom. With all inner nodes in the fast memory, you only have one slow memory access at each search (to fetch the value).
B+ trees are better than binary tree all the dbms use them,
a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.
B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.
B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.
B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.
If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.

What are the advantages of storing all elements in the leaf nodes?

I'm reading Advanced Data Structures by Peter Brass.
In the beginning of the chapter on search trees, he stated that there is two models of search trees - one where nodes contain the actual object (the value if the tree is used as a dictionary), and an other where all objects are stored in leaves and internal nodes are only for comparisons.
What are the advantages of the second model over the first one?
One of the big advantages of a binary tree where data is only in the leaf nodes is that you can partition based on elements that are not in your dataset.
For example, if I have a possible dataset of 0-1 million, but the vast majority of items are either at the high end or low end but not in the middle, I may still want my first compare against 500,000 - even though that number is not in my data set. If every node had data, I could not do this. While not normally needed in theory, I've run into many times that partitioning based on a value outside my data simplified implementation.
B+ trees are an example of a case where all key/values are stored in leaf nodes. The primary advantage here is that since all items are in the leaf nodes, the leaf nodes can be linked together to form a linked list which allows rapid in-order traversal. If you access a particular element, you can always find the next element in the sequence without visiting any parents because the leaf nodes are linked together. Filesystems and database storage systems can take advantage of this structures for range searches and stuff.
Lets say you are building tree over some objects on some complex criteria. On example calculated from multiple properties. Sometimes you can't change this object to store calculated value and calculating this criteria is expansive. So you calculate this criteria only once, and store objects in leafs based on criteria result. Then when your tree is complete you can find required object much faster because you don't have to calculate criteria for each tree node in your path.
well storing information objects in the nodes, we talking in this case about a trie, is usefull for fast retrival of information(faster than storing stuff in an array/hashtable, where the worst case auf acces is O(n), in the trie this is O(m) [m is the lenght of n])
look here:
https://en.wikipedia.org/wiki/Trie
In a search tree this oerations can be much more complicated(look AVL Tree O(log n) ) and so can be slower and is more compley to implement.
What data structure to choose??
Well this depends on what u want to do

Why No Cycles in Eric Lippert's Immutable Binary Tree?

I was just looking at Eric Lippert's simple implementation of an immutable binary tree, and I have a question about it. After showing the implementation, Eric states that
Note that another nice feature of
immutable data structures is that it
is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle.
It seems that this feature of Eric's implementation does not come from the immutability alone, but also from the fact that the tree is built up from the leaves. This naturally prevents a node from having any of its ancestors as children. It seems that if you built the tree in the other direction, you'd introduce the possibility of cycles.
Am I right in my thinking, or does the impossibility of cycles in this case come from the immutability alone? Considering the source, I wonder whether I'm missing something.
EDIT: After thinking it over a bit more, it seems that building up from the leaves might be the only way to create an immutable tree. Am I right?
If you're using an immutable data structure, in a strict (as opposed to lazy) language, it's impossible to create a cycle; as you must create the elements in some order, and once an element is created, you cannot mutate it to point at an element created later. So if you created node n, and then created node m which pointed at n (perhaps indirectly), you could never complete the cycle by causing n to point at m as you are not allowed to mutate n, nor anything that n already points to.
Yes, you are correct that you can only ever create an immutable tree by building up from the leaves; if you started from the root, you would have to modify the root to point at its children as you create them. Only by starting from the leaves, and creating each node to point to its children, can you construct a tree from immutable nodes.
If you really want to try hard at it you could create a tree with cycles in it that is immutable. For example, you could define an immutable graph class and then say:
Graph g = Graph.Empty
.AddNode("A")
.AddNode("B")
.AddNode("C")
.AddEdge("A", "B")
.AddEdge("B", "C")
.AddEdge("C", "A");
And hey, you've got a "tree" with "cycles" in it - because of course you haven't got a tree in the first place, you've got a directed graph.
But with a data type that actually uses a traditional "left and right sub trees" implementation of a binary tree then there is no way to make a cyclic tree (modulo of course sneaky tricks like using reflection or unsafe code.)
When you say "built up from the leaves", I guess you're including the fact that the constructor takes children but never takes a parent.
It seems that if you built the tree in
the other direction, you'd introduce
the possibility of cycles.
No, because then you'd have the opposite constraint: the constructor would have to take a parent but never a child. Therefore you can never create a descendant until all its ancestors are created. Therefore no cycles are possible.
After thinking it over a bit more, it
seems that building up from the leaves
might be the only way to create an
immutable tree. Am I right?
No... see my comments to Brian and ergosys.
For many applications, a tree whose child nodes point to their parents is not very useful. I grant that. If you need to traverse the tree in an order determined by its hierarchy, an upward-pointing tree makes that hard.
However for other applications, that sort of tree is exactly the sort we want. For example, we have a database of articles. Each article can have one or more translations. Each translation can have translations. We create this data structure as a relational database table, where each record has a "foreign key" (pointer) to its parent. None of these records need ever change its pointer to its parent. When a new article or translation is added, the record is created with a pointer to the appropriate parent.
A common use case is to query the table of translations, looking for translations for a particular article, or translations in a particular language. Ah, you say, the table of translations is a mutable data structure.
Sure it is. But it's separate from the tree. We use the (immutable) tree to record the hierarchical relationships, and the mutable table for iteration over the items. In a non-database situation, you could have a hash table pointing to the nodes of the tree. Either way, the tree itself (i.e. the nodes) never get modified.
Here's another example of this data structure, including how to usefully access the nodes.
My point is that the answer to the OP's question is "yes", I agree with the rest of you, that the prevention of cycles does come from immutability alone. While you can build a tree in the other direction (top-down), if you do, and it's immutable, it still cannot have cycles.
When you're talking about powerful theoretical guarantees like
another nice feature of immutable data structures is that
it is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle [emphasis in original]
"such a tree wouldn't be very useful" pales in comparison -- even if it were true.
People create un-useful data structures by accident all the time, let alone creating supposedly-useless ones on purpose. The putative uselessness doesn't protect the program from the pitfalls of cycles in your data structures. A theoretical guarantee does (assuming you really meet the criteria it states).
P.S. one nice feature of upward-pointing trees is that you can guarantee one aspect of the definition of trees that downward-pointing tree data structures (like Eric Lippert's) don't: that every node has at most one parent. (See David's comment and my response.)
You can't build it from the root, it requires you to mutate nodes you already added.

When to choose RB tree, B-Tree or AVL tree?

As a programmer when should I consider using a RB tree, B- tree or an AVL tree?
What are the key points that needs to be considered before deciding on the choice?
Can someone please explain with a scenario for each tree structure why it is chosen over others with reference to the key points?
Take this with a pinch of salt:
B-tree when you're managing more than thousands of items and you're paging them from a disk or some slow storage medium.
RB tree when you're doing fairly frequent inserts, deletes and retrievals on the tree.
AVL tree when your inserts and deletes are infrequent relative to your retrievals.
I think B+ trees are a good general-purpose ordered container data structure, even in main memory. Even when virtual memory isn't an issue, cache-friendliness often is, and B+ trees are particularly good for sequential access - the same asymptotic performance as a linked list, but with cache-friendliness close to a simple array. All this and O(log n) search, insert and delete.
B+ trees do have problems, though - such as the items moving around within nodes when you do inserts/deletes, invalidating pointers to those items. I have a container library that does "cursor maintenance" - cursors attach themselves to the leaf node they currently reference in a linked list, so they can be fixed or invalidated automatically. Since there's rarely more than one or two cursors, it works well - but it's an extra bit of work all the same.
Another thing is that the B+ tree is essentially just that. I guess you can strip off or recreate the non-leaf nodes depending on whether you need them or not, but with binary tree nodes you get a lot more flexibility. A binary tree can be converted to a linked list and back without copying nodes - you just change the pointers then remember that you're treating it as a different data structure now. Among other things, this means you get fairly easy O(n) merging of trees - convert both trees to lists, merge them, then convert back to a tree.
Yet another thing is memory allocation and freeing. In a binary tree, this can be separated out from the algorithms - the user can create a node then call the insert algorithm, and deletes can extract nodes (detach them from the tree, but dont free the memory). In a B-tree or B+-tree, that obviously doesn't work - the data will live in a multi-item node. Writing insert methods that "plan" the operation without modifying nodes until they know how many new nodes are needed and that they can be allocated is a challenge.
Red black vs. AVL? I'm not sure it makes any big difference. My own library has a policy-based "tool" class to manipulate nodes, with methods for double-linked lists, simple binary trees, splay trees, red-black trees and treaps, including various conversions. Some of those methods were only implemented because I was bored at one time or another. I'm not sure I've even tested the treap methods. The reason I chose red-black trees rather than AVL is because I personally understand the algorithms better - which doesn't mean they're simpler, it's just a fluke of history that I'm more familiar with them.
One last thing - I only originally developed my B+ tree containers as an experiment. It's one of those experiments that never ended really, but it's not something I'd encourage others to repeat. If all you need is an ordered container, the best answer is to use the one that your existing library provides - e.g. std::map etc in C++. My library evolved over years, it took quite a while to get it stable, and I just relatively recently discovered it's technically non-portable (dependent on a bit of undefined behaviour WRT offsetof).
In memory B-Tree has the advantage when the number of items is more than 32000... Look at speedtest.pdf from stx-btree.
When choosing data structures you are trading off factors such as
speed of retrieval v speed of update
how well the structure copes with worst case operations, for example insertion of records that arrive in a sorted order
space wasted
I would start by reading the Wikipedia articles referenced by Robert Harvey.
Pragmatically, when working in languages such as Java the average programmer tends to use the collection classes provided. If in a performance tuning activity one discovers that the collection performance is problematic then one can seek alternative implementations. It's rarely the first thing a business-led development has to consider. It's extremely rare that one needs to implement such data structures by hand, there are usually libraries that can be used.

Resources