Why do we always want shallow binary tree? In what cases is shallow binary tree better than non-shallow/minimum depth tree?
I am just confused as my prof keeps saying we want to aim for shallowest possible binary tree but I do not understand why. I guess smallar is better but is there any specific concrete reason? Sorry for my bad english thanks for your help
I'm assuming this is in regards to binary search trees - if not, please let me know and I can update this answer.
In a binary search tree, the cost of almost every operation (insertion, deletion, lookup, successor, predecessor, min, max, range search, split, join, etc.) depends on the height of the binary search tree. The reason for this is that these operations work by walking down the tree from the root until they either fall off the tree or find what they're looking for. The deeper the tree, the longer this can take if you get bad inputs.
By shuffling nodes around to keep the tree height low, we can make it so that these operations are, in general, very fast. A tree with height h can have at most 2h - 1 nodes in it, which is a huge number compared with h (figure that if h = 20, 2h - 1 is over a million!), so if you make an effort to pack the nodes into the tree higher up and closer to the root, you'll get better operation speeds all around.
There are some cases where it's actually beneficial to have trees that are as imbalanced as possible. For example, if you have a binary search tree and know in advance that some elements will be looked up more than others, you may want to shuffle the nodes around in the tree to put the high-frequency items higher up and the low-frequency items deeper in the tree. In non-binary-search-tree contexts, the randomized meldable priority queue works by randomly walking down a tree doing merges, and the less balanced the tree is the more likely it is for these operations to end early by falling off the tree.
I have been studying Splay trees because I would like to implement one. Currently, I have some "autodidactic" experience with Red-Black trees, AVL trees, Skip lists and other simpler data structures. I want to implement my first splay tree, but I want a recursive implementation for it, if possible (I love recursion).
However, I think it's difficult because you have to see two levels down the tree to observe all the possible cases (zig-zag, zig-zig, zar), and there is no way to mark the target without another field. Should I use another field, like in red-black trees, to mark the visited nodes and splay the target node?
It's easy enough to use a recursive algorithm, and it might work out looking fairly clean. No marking is necessary. Remember that the splay operation (which is used for find, insert and delete) brings the target node to the top of the tree; in other words, it returns the (splayed) tree with the target node at the top.
In essence, you need to decide from a given node what the next two moves will be (left-left, right-right, or anything else.) The rotation happens when you go the same direction twice.
There's a nice implementation for functional languages in Chris Okasaki's Purely Functional Data Structures, which imho is one of the finest short CS texts in existence.
On wikipedia you can find a really nice article about splay trees. You should not love recursion, because recursion can easily get out of hands, it's better to use iterativity.
I'm studying how to balance trees and I have some questions
Is it possible to balance a normal binary tree? If yes, which algorithm should be used?
Do I necessarily have to use a AVL or Red-black tree to obtain a balanced tree? How do these work?
I read something about rotations, weights but I'm kind of confused right now
Is it possible to balance a normal binary tree? If yes, which
algorithm should be used?
In O(n) you can build a complete tree, and populate it with the elements in in-order traversal.
It cannot be done better, because A BST might in rare cases decay to a chain (linked list), where all nodes have one son as null. In this cases, accessing the element in the middle is O(n) itself.
Do I necessarily have to use a AVL or Red-black tree to obtain a
balanced tree?
There are other balanced trees such as B+ trees, and other data structures (not trees) such as skip-lists. You might want to have a look at a list of known data structures, especially the trees section.
How do these work?
I find the wikipedia articles both on AVL tree and Red-Black tree very informative. If you have something specific you don't understand there - you should ask.
Also: Trying to implement a balanced trees on your own (Implement a known tree, not inventing a new one - of course) - is great for educational purposes, and by doing so - you will definitely understand how it works.
Well... AVL and red-black trees are "normal binary trees" that are balanced, and keep that balance (for some definition of "balanced"). I'm not a computer science teacher to come up with my own explanation of the algorithms, and I guess you aren't looking for a cut&paste from Wikipedia :-)
Now, for balancing binary trees: if the tree is a search tree (i.e. 'sorted', but 'balanced' doesn't really make all that much sense if it's not) you could always just recreate the tree. The simplest algorithm is to use an array with all the elements from the tree, in sorted order (easily obtained from an inorder traversal). Then build an algorithm around this general idea:
take the middle element of the array as the root of the tree. This will create a tree node, and two arrays "left" and "right", which are meant to form the left and right subtrees
Apply this same algorithm recursively to create a tree from the "left" array and one from the "right" array. These two trees become the children of the parent node.
You might have to be careful with the case when the array has an even number of elements: there is no obvious "middle element", and removing one of the two candidates will create arrays of different sizes. I'm too lazy to analyze this further to see if that could offset the whole balancing thing.
Of course, doing something like this every time you change the tree isn't such a great idea; you really want to use self-balancing trees like AVL for that. Doing it after creating the tree might not be all that useful either: you could just use the array itself and do binary searches on it, instead of making a tree. The array IS just another form of a binary tree...
EDIT: there is a reason why a lot of computer scientists have spent a lot of time developing data structures and algorithms that perform well in certain situations. Rolling your own version of a balanced binary tree is unlikely to beat these...
Can you balance an unbalanced tree?
Yes, You can. You use the same balance function you created for your AVL Tree inside a PostOrderTraversal function.
Should You Do it?
No!!! You should recreate it! Balancing the tree will cost you unnecessarily.
How do I recreate it?
Use an InOrderTraversal function to put your nodes into an array. Then use a variable that will always go to the middle of the array and the left middle, right middle and add the nodes to the new Tree.
Is it possible to balance a normal binary tree? If yes, which algorithm should be used?
Do I necessarily have to use a AVL or Red-black tree to obtain a balanced tree? How do these work?
In general, Trees are either unbalanced or balanced. AVL, Red-Black, 2-3, e.t.c. are just trees with some properties and according to their properties they use some extra variables and functions. Those extra variables and function can also be used in the "normal" binary trees. In other words those functions and variables are not bounded to their respective type of tree. The nodes of a "normal" binary tree always had a balance! You just didn't use it because you didn't care if the "normal" binary tree was balanced or not. They also always had a height, depth, e.t.c. You just didn't care. In general, you will realize at one point that all are a trade-off between speed and memory. If you know what you are doing, more memory usage will make your program faster. Less memory usage means more calculations so you will have a slower program.
I have just finished a job interview and I was struggling with this question, which seems to me as a very hard question for giving on a 15 minutes interview.
The question was:
Write a function, which given a stream of integers (unordered), builds a balanced search tree.
Now, you can't wait for the input to end (it's a stream), so you need to balance the tree on the fly.
My first answer was to use a Red-Black tree, which of course does the job, but i have to assume they didn't expect me to implement a red black tree in 15 minutes.
So, is there any simple solution for this problem i'm not aware of?
Thanks,
Dave
I personally think that the best way to do this would be to go for a randomized binary search tree like a treap. This doesn't absolutely guarantee that the tree will be balanced, but with high probability the tree will have a good balance factor. A treap works by augmenting each element of the tree with a uniformly random number, then ensuring that the tree is a binary search tree with respect to the keys and a heap with respect to the uniform random values. Insertion into a treap is extremely easy:
Pick a random number to assign to the newly-added element.
Insert the element into the BST using standard BST insertion.
While the newly-inserted element's key is greater than the key of its parent, perform a tree rotation to bring the new element above its parent.
That last step is the only really hard one, but if you had some time to work it out on a whiteboard I'm pretty sure that you could implement this on-the-fly in an interview.
Another option that might work would be to use a splay tree. It's another type of fast BST that can be implemented assuming you have a standard BST insert function and the ability to do tree rotations. Importantly, splay trees are extremely fast in practice, and it's known that they are (to within a constant factor) at least as good as any other static binary search tree.
Depending on what's meant by "search tree," you could also consider storing the integers in some structure optimized for lookup of integers. For example, you could use a bitwise trie to store the integers, which supports lookup in time proportional to the number of bits in a machine word. This can be implemented quite nicely using a recursive function to look over the bits, and doesn't require any sort of rotations. If you needed to blast out an implementation in fifteen minutes, and if the interviewer allows you to deviate from the standard binary search trees, then this might be a great solution.
Hope this helps!
AA Trees are a bit simpler than Red-Black trees, but I couldn't implement one off the top of my head.
One of the simplest balanced binary search tree is BB(α)-tree. You pick the constant α, which says how much unbalanced can the tree get. At all times, #descendants(child) <= (1-α) × #descendants(node) must hold. You treat it as normal binary search tree, but when the formula doesn't apply to some node anymore, you just rebuild that part of the tree from scratch, so that it is perfectly balanced.
The amortized time complexity for insertion or deletion is still O(log N), just as with other balanced binary trees.
I've seen binary trees and binary searching mentioned in several books I've read lately, but as I'm still at the beginning of my studies in Computer Science, I've yet to take a class that's really dealt with algorithms and data structures in a serious way.
I've checked around the typical sources (Wikipedia, Google) and most descriptions of the usefulness and implementation of (in particular) Red-Black trees have come off as dense and difficult to understand. I'm sure for someone with the necessary background, it makes perfect sense, but at the moment it reads like a foreign language almost.
So what makes binary trees useful in some of the common tasks you find yourself doing while programming? Beyond that, which trees do you prefer to use (please include a sample implementation) and why?
Red Black trees are good for creating well-balanced trees. The major problem with binary search trees is that you can make them unbalanced very easily. Imagine your first number is a 15. Then all the numbers after that are increasingly smaller than 15. You'll have a tree that is very heavy on the left side and has nothing on the right side.
Red Black trees solve that by forcing your tree to be balanced whenever you insert or delete. It accomplishes this through a series of rotations between ancestor nodes and child nodes. The algorithm is actually pretty straightforward, although it is a bit long. I'd suggest picking up the CLRS (Cormen, Lieserson, Rivest and Stein) textbook, "Introduction to Algorithms" and reading up on RB Trees.
The implementation is also not really so short so it's probably not really best to include it here. Nevertheless, trees are used extensively for high performance apps that need access to lots of data. They provide a very efficient way of finding nodes, with a relatively small overhead of insertion/deletion. Again, I'd suggest looking at CLRS to read up on how they're used.
While BSTs may not be used explicitly - one example of the use of trees in general are in almost every single modern RDBMS. Similarly, your file system is almost certainly represented as some sort of tree structure, and files are likewise indexed that way. Trees power Google. Trees power just about every website on the internet.
I'd like to address only the question "So what makes binary trees useful in some of the common tasks you find yourself doing while programming?"
This is a big topic that many people disagree on. Some say that the algorithms taught in a CS degree such as binary search trees and directed graphs are not used in day-to-day programming and are therefore irrelevant. Others disagree, saying that these algorithms and data structures are the foundation for all of our programming and it is essential to understand them, even if you never have to write one for yourself. This filters into conversations about good interviewing and hiring practices. For example, Steve Yegge has an article on interviewing at Google that addresses this question. Remember this debate; experienced people disagree.
In typical business programming you may not need to create binary trees or even trees very often at all. However, you will use many classes which internally operate using trees. Many of the core organization classes in every language use trees and hashes to store and access data.
If you are involved in more high-performance endeavors or situations that are somewhat outside the norm of business programming, you will find trees to be an immediate friend. As another poster said, trees are core data structures for databases and indexes of all kinds. They are useful in data mining and visualization, advanced graphics (2d and 3d), and a host of other computational problems.
I have used binary trees in the form of BSP (binary space partitioning) trees in 3d graphics. I am currently looking at trees again to sort large amounts of geocoded data and other data for information visualization in Flash/Flex applications. Whenever you are pushing the boundary of the hardware or you want to run on lower hardware specifications, understanding and selecting the best algorithm can make the difference between failure and success.
None of the answers mention what it is exactly BSTs are good for.
If what you want to do is just lookup by values then a hashtable is much faster, O(1) insert and lookup (amortized best case).
A BST will be O(log N) lookup where N is the number of nodes in the tree, inserts are also O(log N).
RB and AVL trees are important like another answer mentioned because of this property, if a plain BST is created with in-order values then the tree will be as high as the number of values inserted, this is bad for lookup performance.
The difference between RB and AVL trees are in the the rotations required to rebalance after an insert or delete, AVL trees are O(log N) for rebalances while RB trees are O(1). An example of benefit of this constant complexity is in a case where you might be keeping a persistent data source, if you need to track changes to roll-back you would have to track O(log N) possible changes with an AVL tree.
Why would you be willing to pay for the cost of a tree over a hash table? ORDER! Hash tables have no order, BSTs on the other hand are always naturally ordered by virtue of their structure. So if you find yourself throwing a bunch of data in an array or other container and then sorting it later, a BST may be a better solution.
The tree's order property gives you a number of ordered iteration capabilities, in-order, depth-first, breadth-first, pre-order, post-order. These iteration algorithms are useful in different circumstances if you want to look them up.
Red black trees are used internally in almost every ordered container of language libraries, C++ Set and Map, .NET SortedDictionary, Java TreeSet, etc...
So trees are very useful, and you may use them quite often without even knowing it. You most likely will never need to write one yourself, though I would highly recommend it as an interesting programming exercise.
Red Black Trees and B-trees are used in all sorts of persistent storage; because the trees are balanced the performance of breadth and depth traversals are mitigated.
Nearly all modern database systems use trees for data storage.
BSTs make the world go round, as said by Micheal. If you're looking for a good tree to implement, take a look at AVL trees (Wikipedia). They have a balancing condition, so they are guaranteed to be O(logn). This kind of searching efficiency makes it logical to put into any kind of indexing process. The only thing that would be more efficient would be a hashing function, but those get ugly quick, fast, and in a hurry. Also, you run into the Birthday Paradox (also known as the pigeon-hole problem).
What textbook are you using? We used Data Structures and Analysis in Java by Mark Allen Weiss. I actually have it open in my lap as i'm typing this. It has a great section about Red-Black trees, and even includes the code necessary to implement all the trees it talks about.
Red-black trees stay balanced, so you don't have to traverse deep to get items out. The time saved makes RB trees O(log()n)) in the WORST case, whereas unlucky binary trees can get into a lop sided configuration and cause retrievals in O(n) a bad case. This does happen in practice or on random data. So if you need time critical code (database retrievals, network server etc.) you use RB trees to support ordered or unordered lists/sets .
But RBTrees are for noobs! If you are doing AI and you need to perform a search, you find you fork the state information alot. You can use a persistent red-black to fork new states in O(log(n)). A persistent red black tree keeps a copy of the tree before and after a morphological operation (insert/delete), but without copying the entire tree (normally and O(log(n)) operation). I have open sourced a persistent red-black tree for java
http://edinburghhacklab.com/2011/07/a-java-implementation-of-persistent-red-black-trees-open-sourced/
The best description of red-black trees I have seen is the one in Cormen, Leisersen and Rivest's 'Introduction to Algorithms'. I could even understand it enough to partially implement one (insertion only). There are also quite a few applets such as This One on various web pages that animate the process and allow you to watch and step through a graphical representation of the algorithm building a tree structure.
Since you ask which tree people use, you need to know that a Red Black tree is fundamentally a 2-3-4 B-tree (i.e a B-tree of order 4). A B-tree is not equivalent to a binary tree(as asked in your question).
Here's an excellent resource describing the initial abstraction known as the symmetric binary B-tree that later evolved into the RBTree. You would need a good grasp on B-trees before it makes sense. To summarize: a 'red' link on a Red Black tree is a way to represent nodes that are part of a B-tree node (values within a key range), whereas 'black' links are nodes that are connected vertically in a B-tree.
So, here's what you get when you translate the rules of a Red Black tree in terms of a B-tree (I'm using the format Red Black tree rule => B Tree equivalent):
1) A node is either red or black. => A node in a b-tree can either be part of a node, or as a node in a new level.
2) The root is black. (This rule is sometimes omitted, since it doesn't affect analysis) => The root node can be thought of either as a part of an internal root node as a child of an imaginary parent node.
3) All leaves (NIL) are black. (All leaves are same color as the root.) => Since one way of representing a RB tree is by omitting the leaves, we can rule this out.
4)Both children of every red node are black. => The children of an internal node in a B-tree always lie on another level.
5)Every simple path from a given node to any of its descendant leaves contains the same number of black nodes. => A B-tree is kept balanced as it requires that all leaf nodes are at the same depth (Hence the height of a B-tree node is represented by the number of black links from the root to the leaf of a Red Black tree)
Also, there's a simpler 'non-standard' implementation by Robert Sedgewick here: (He's the author of the book Algorithms along with Wayne)
Lots and lots of heat here, but not much light, so lets see if we can provide some.
First, a RB tree is an associative data structure, unlike, say an array, which cannot take a key and return an associated value, well, unless that's an integer "key" in a 0% sparse index of contiguous integers. An array cannot grow in size either (yes, I know about realloc() too, but under the covers that requires a new array and then a memcpy()), so if you have either of these requirements, an array won't do. An array's memory efficiency is perfect. Zero waste, but not very smart, or flexible - realloc() not withstanding.
Second, in contrast to a bsearch() on an array of elements, which IS an associative data structure, a RB tree can grow (AND shrink) itself in size dynamically. The bsearch() works fine for indexing a data structure of a known size, which will remain that size. So if you don't know the size of your data in advance, or new elements need to be added, or deleted, a bsearch() is out. Bsearch() and qsort() are both well supported in classic C, and have good memory efficiency, but are not dynamic enough for many applications. They are my personal favorite though because they're quick, easy, and if you're not dealing with real-time apps, quite often are flexible enough. In addition, in C/C++ you can sort an array of pointers to data records, pointing to the struc{} member, for example, you wish to compare, and then rearranging the pointer in the pointer array such that reading the pointers in order at the end of the pointer sort yields your data in sorted order. Using this with memory-mapped data files is extremely memory efficient, fast, and fairly easy. All you need to do is add a few "*"s to your compare function/s.
Third, in contrast to a hashtable, which also must be a fixed size, and cannot be grown once filled, a RB tree will automagically grow itself and balance itself to maintain its O(log(n)) performance guarantee. Especially if the RB tree's key is an int, it can be faster than a hash, because even though a hashtable's complexity is O(1), that 1 can be a very expensive hash calculation. A tree's multiple 1-clock integer compares often outperform 100-clock+ hash calculations, to say nothing of rehashing, and malloc()ing space for hash collisions and rehashes. Finally, if you want ISAM access, as well as key access to your data, a hash is ruled out, as there is no ordering of the data inherent in the hashtable, in contrast to the natural ordering of data in any tree implementation. The classic use for a hash table is to provide keyed access to a table of reserved words for a compiler. It's memory efficiency is excellent.
Fourth, and very low on any list, is the linked, or doubly-linked list, which, in contrast to an array, naturally supports element insertions and deletions, and as that implies, resizing. It's the slowest of all the data structures, as each element only knows how to get to the next element, so you have to search, on average, (element_knt/2) links to find your datum. It is mostly used where insertions and deletions somewhere in the middle of the list are common, and especially, where the list is circular and feeds an expensive process which makes the time to read the links relatively small. My general RX is to use an arbitrarily large array instead of a linked list if your only requirement is that it be able to increase in size. If you run out of size with an array, you can realloc() a larger array. The STL does this for you "under the covers" when you use a vector. Crude, but potentially 1,000s of times faster if you don't need insertions, deletions or keyed lookups. It's memory efficiency is poor, especially for doubly-linked lists. In fact, a doubly-linked list, requiring two pointers, is exactly as memory inefficient as a red-black tree while having NONE of its appealing fast, ordered retrieval characteristics.
Fifth, trees support many additional operations on their sorted data than any other data structure. For example, many database queries make use of the fact that a range of leaf values can be easily specified by specifying their common parent, and then focusing subsequent processing on the part of the tree that parent "owns". The potential for multi-threading offered by this approach should be obvious, as only a small region of the tree needs to be locked - namely, only the nodes the parent owns, and the parent itself.
In short, trees are the Cadillac of data structures. You pay a high price in terms of memory used, but you get a completely self-maintaining data structure. This is why, as pointed out in other replies here, transaction databases use trees almost exclusively.
If you would like to see how a Red-Black tree is supposed to look graphically, I have coded an implementation of a Red-Black tree that you can download here
IME, almost no one understands the RB tree algorithm. People can repeat the rules back to you, but they don't understand why those rules and where they come from. I am no exception :-)
For this reason, I prefer the AVL algorithm, because it's easy to comprehend. Once you understand it, you can then code it up from scratch, because it make sense to you.
Trees can be fast. If you have a million nodes in a balanced binary tree, it takes twenty comparisons on average to find any one item. If you have a million nodes in a linked list, it takes five hundred thousands comparisons on average to find the same item.
If the tree is unbalanced, though, it can be just as slow as a list, and also take more memory to store. Imagine a tree where most nodes have a right child, but no left child; it is a list, but you still have to hold memory space to put in the left node if one shows up.
Anyways, the AVL tree was the first balanced binary tree algorithm, and the Wikipedia article on it is pretty clear. The Wikipedia article on red-black trees is clear as mud, honestly.
Beyond binary trees, B-Trees are trees where each node can have many values. B-Tree is not a binary tree, just happens to be the name of it. They're really useful for utilizing memory efficiently; each node of the tree can be sized to fit in one block of memory, so that you're not (slowly) going and finding tons of different things in memory that was paged to disk. Here's a phenomenal example of the B-Tree.