Data structure: a graph that's similar to a tree - but not a tree - data-structures

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!

I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

Related

Decision Tree clarification

I just want to ask/clarify if decision trees are essentially binary trees where each node is a boolean, and it continues down until a desired result is reached?
Not necessarily. Some nodes may share children which is not the case in Binary trees. However, the essence of the decision tree is what you mentioned.
It's a tree where based on the probability of an outcome you move down the graph until you hit an outcome.
See Wikipedia's page on desicion trees for more info.
As mentioned by Ares, not all decision trees are binary (they can be "n-ary") although most implementation I have seen are binary trees.
For instance if you have a color variable (i.e. categorical) that can take three values : red, blue or green; you might want to split in three directly at a node instead of splitting in two and then again in two (or more).
The choice between binary and "n-ary" will usually depends on your data. I suspect that most people use binary trees anyway because it is relatively easier to implement and more flexible.
Then as you said the tree is developed until the desired outcome is reached. Decision Tree suffers major drawbacks such as overfitting and there exist many different ways to tackle this issue (pruning, boosting, etc.) but this is beyond the scope of the question/answer.
I recommend to have a look at this great visualization that explains well the decision tree : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Will be happy to give more details about decision tree

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Why No Cycles in Eric Lippert's Immutable Binary Tree?

I was just looking at Eric Lippert's simple implementation of an immutable binary tree, and I have a question about it. After showing the implementation, Eric states that
Note that another nice feature of
immutable data structures is that it
is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle.
It seems that this feature of Eric's implementation does not come from the immutability alone, but also from the fact that the tree is built up from the leaves. This naturally prevents a node from having any of its ancestors as children. It seems that if you built the tree in the other direction, you'd introduce the possibility of cycles.
Am I right in my thinking, or does the impossibility of cycles in this case come from the immutability alone? Considering the source, I wonder whether I'm missing something.
EDIT: After thinking it over a bit more, it seems that building up from the leaves might be the only way to create an immutable tree. Am I right?
If you're using an immutable data structure, in a strict (as opposed to lazy) language, it's impossible to create a cycle; as you must create the elements in some order, and once an element is created, you cannot mutate it to point at an element created later. So if you created node n, and then created node m which pointed at n (perhaps indirectly), you could never complete the cycle by causing n to point at m as you are not allowed to mutate n, nor anything that n already points to.
Yes, you are correct that you can only ever create an immutable tree by building up from the leaves; if you started from the root, you would have to modify the root to point at its children as you create them. Only by starting from the leaves, and creating each node to point to its children, can you construct a tree from immutable nodes.
If you really want to try hard at it you could create a tree with cycles in it that is immutable. For example, you could define an immutable graph class and then say:
Graph g = Graph.Empty
.AddNode("A")
.AddNode("B")
.AddNode("C")
.AddEdge("A", "B")
.AddEdge("B", "C")
.AddEdge("C", "A");
And hey, you've got a "tree" with "cycles" in it - because of course you haven't got a tree in the first place, you've got a directed graph.
But with a data type that actually uses a traditional "left and right sub trees" implementation of a binary tree then there is no way to make a cyclic tree (modulo of course sneaky tricks like using reflection or unsafe code.)
When you say "built up from the leaves", I guess you're including the fact that the constructor takes children but never takes a parent.
It seems that if you built the tree in
the other direction, you'd introduce
the possibility of cycles.
No, because then you'd have the opposite constraint: the constructor would have to take a parent but never a child. Therefore you can never create a descendant until all its ancestors are created. Therefore no cycles are possible.
After thinking it over a bit more, it
seems that building up from the leaves
might be the only way to create an
immutable tree. Am I right?
No... see my comments to Brian and ergosys.
For many applications, a tree whose child nodes point to their parents is not very useful. I grant that. If you need to traverse the tree in an order determined by its hierarchy, an upward-pointing tree makes that hard.
However for other applications, that sort of tree is exactly the sort we want. For example, we have a database of articles. Each article can have one or more translations. Each translation can have translations. We create this data structure as a relational database table, where each record has a "foreign key" (pointer) to its parent. None of these records need ever change its pointer to its parent. When a new article or translation is added, the record is created with a pointer to the appropriate parent.
A common use case is to query the table of translations, looking for translations for a particular article, or translations in a particular language. Ah, you say, the table of translations is a mutable data structure.
Sure it is. But it's separate from the tree. We use the (immutable) tree to record the hierarchical relationships, and the mutable table for iteration over the items. In a non-database situation, you could have a hash table pointing to the nodes of the tree. Either way, the tree itself (i.e. the nodes) never get modified.
Here's another example of this data structure, including how to usefully access the nodes.
My point is that the answer to the OP's question is "yes", I agree with the rest of you, that the prevention of cycles does come from immutability alone. While you can build a tree in the other direction (top-down), if you do, and it's immutable, it still cannot have cycles.
When you're talking about powerful theoretical guarantees like
another nice feature of immutable data structures is that
it is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle [emphasis in original]
"such a tree wouldn't be very useful" pales in comparison -- even if it were true.
People create un-useful data structures by accident all the time, let alone creating supposedly-useless ones on purpose. The putative uselessness doesn't protect the program from the pitfalls of cycles in your data structures. A theoretical guarantee does (assuming you really meet the criteria it states).
P.S. one nice feature of upward-pointing trees is that you can guarantee one aspect of the definition of trees that downward-pointing tree data structures (like Eric Lippert's) don't: that every node has at most one parent. (See David's comment and my response.)
You can't build it from the root, it requires you to mutate nodes you already added.

When to choose RB tree, B-Tree or AVL tree?

As a programmer when should I consider using a RB tree, B- tree or an AVL tree?
What are the key points that needs to be considered before deciding on the choice?
Can someone please explain with a scenario for each tree structure why it is chosen over others with reference to the key points?
Take this with a pinch of salt:
B-tree when you're managing more than thousands of items and you're paging them from a disk or some slow storage medium.
RB tree when you're doing fairly frequent inserts, deletes and retrievals on the tree.
AVL tree when your inserts and deletes are infrequent relative to your retrievals.
I think B+ trees are a good general-purpose ordered container data structure, even in main memory. Even when virtual memory isn't an issue, cache-friendliness often is, and B+ trees are particularly good for sequential access - the same asymptotic performance as a linked list, but with cache-friendliness close to a simple array. All this and O(log n) search, insert and delete.
B+ trees do have problems, though - such as the items moving around within nodes when you do inserts/deletes, invalidating pointers to those items. I have a container library that does "cursor maintenance" - cursors attach themselves to the leaf node they currently reference in a linked list, so they can be fixed or invalidated automatically. Since there's rarely more than one or two cursors, it works well - but it's an extra bit of work all the same.
Another thing is that the B+ tree is essentially just that. I guess you can strip off or recreate the non-leaf nodes depending on whether you need them or not, but with binary tree nodes you get a lot more flexibility. A binary tree can be converted to a linked list and back without copying nodes - you just change the pointers then remember that you're treating it as a different data structure now. Among other things, this means you get fairly easy O(n) merging of trees - convert both trees to lists, merge them, then convert back to a tree.
Yet another thing is memory allocation and freeing. In a binary tree, this can be separated out from the algorithms - the user can create a node then call the insert algorithm, and deletes can extract nodes (detach them from the tree, but dont free the memory). In a B-tree or B+-tree, that obviously doesn't work - the data will live in a multi-item node. Writing insert methods that "plan" the operation without modifying nodes until they know how many new nodes are needed and that they can be allocated is a challenge.
Red black vs. AVL? I'm not sure it makes any big difference. My own library has a policy-based "tool" class to manipulate nodes, with methods for double-linked lists, simple binary trees, splay trees, red-black trees and treaps, including various conversions. Some of those methods were only implemented because I was bored at one time or another. I'm not sure I've even tested the treap methods. The reason I chose red-black trees rather than AVL is because I personally understand the algorithms better - which doesn't mean they're simpler, it's just a fluke of history that I'm more familiar with them.
One last thing - I only originally developed my B+ tree containers as an experiment. It's one of those experiments that never ended really, but it's not something I'd encourage others to repeat. If all you need is an ordered container, the best answer is to use the one that your existing library provides - e.g. std::map etc in C++. My library evolved over years, it took quite a while to get it stable, and I just relatively recently discovered it's technically non-portable (dependent on a bit of undefined behaviour WRT offsetof).
In memory B-Tree has the advantage when the number of items is more than 32000... Look at speedtest.pdf from stx-btree.
When choosing data structures you are trading off factors such as
speed of retrieval v speed of update
how well the structure copes with worst case operations, for example insertion of records that arrive in a sorted order
space wasted
I would start by reading the Wikipedia articles referenced by Robert Harvey.
Pragmatically, when working in languages such as Java the average programmer tends to use the collection classes provided. If in a performance tuning activity one discovers that the collection performance is problematic then one can seek alternative implementations. It's rarely the first thing a business-led development has to consider. It's extremely rare that one needs to implement such data structures by hand, there are usually libraries that can be used.

What's a good data structure for building equivalence classes on nodes of a tree?

I'm looking for a good data structure to build equivalence classes on nodes of a tree. In an ideal structure, the following operations should be fast (O(1)/O(n) as appropriate) and easy (no paragraphs of mystery code):
(A) Walk the tree from the root; on each node --> child transition enumerate all the equivalent versions of the child node
(B) Merge two equivalence classes
(C) Create new nodes from a list of existing nodes (the children) and other data
(D) Find any nodes structurally equivalent to node (i.e. they have the same number of children, corresponding children belong to the same equivalence class, and their "other data" is equal) so that new (or newly modified) nodes may be put in the right equivalence class (via a merge)
So far I've considered (some of these could be used in combination):
A parfait, where the children are references to collections of nodes instead of to nodes. (A) is fast, (B) requires walking the tree and updating nodes to point to the merged collection, (C) requires finding the collection containing each child of the new node, (D) requires walking the tree
Maintaining a hash of nodes by their characteristics. This makes (D) much faster but (B) slower (since the hash would have to be updated when equivalence classes were merged)
String the nodes together into a circular linked list. (A) is fast, (B) would be fast but for the fact that that "merging" part of a circular list with itself actually splits the list (C) would be fast, (D) would require walking the tree
Like above, but with an additional "up" pointer in each node, which could be used to find a canonical member of the circular list.
Am I missing a sweet alternative?
You seem to have two forms of equivalence to deal with. Plain equivalence (A), tracked as equivalence classes which are kept up to date and structural equivalence (D), for which you occasionally go build a single equivalence class and then throw it away.
It sounds to me like the problem would be conceptually simpler if you maintain equivalence classes for both plain and structural equivalence. If that introduces too much churn for the structural equivalence, you could maintain equivalence classes for some aspects of structural equivalence. Then you could find a balance where you can afford the maintenance of those equivalence classes but still greatly reduce the number of nodes to examine when building a list of structurally equivalent nodes.
I don't think any one structure is going to solve your problems, but you might take a look at the Disjoint-set data structure. An equivalence class, after all, is the same thing as a partitioning of a set. It should be able to handle some of those operations speedily.
Stepping back for a moment I'd suggest against using a tree at all. Last time I had to confront a similar problem, I began with a tree, but later moved onto an array.
Reasons being multiple but number one reason was performance, my classes with up to 100 or so children would actually perform better while manipulating them as array than through the the nodes of a tree, mostly because of hardware locality, and CPU prefetch logic, and CPU pipelining.
So although algorithmically an array structure requires a larger N of operations than a tree, performing these dozens of operations is likely faster than chasing pointers across memory.

Resources