How does Trie and B+ tree compare for indexing lexicographically sorted strings [on the order some billions]?
It should support range queries as well.
From perf. as well as implementation complexity point of view.
I would say it depends on what you mean by Range.
If your range is expressed as All words beginning by, then a Trie is the right choice I'd say. On the other hand, Trie are not meant for requests like All words between XX and ZZ.
Note that the branching factor of the B+ Tree affects its performance (the number of intermediary nodes). If h is the height of the tree, then nmax ~~ bh. Therefore h ~~ log(nmax) / log(b).
With n = 1 000 000 000 and b = 100, we have h ~~ 5. Therefore it means only 5 pointer dereferencing for going from the root to the leaf. It's more cache-friendly than a Trie.
Finally, B+ Tree is admittedly more difficult to implement than a Trie: it's more on a Red-Black Tree level of complexity.
Depends on your actual task:
If you want to get the whole subtree, a B+Tree is your best choice because it is space efficient.
But if you want to get the first N children from a substree, then a Trie is the best choice because you simply visit less nodes than in a B+ Tree scenario.
The most popular task which is well-handled by a Trie is a word prefix completion.
Wikipedia has some algorithmic complexity facts:
B+ tree (section Characteristics), Trie (unfortunately spread all over the article). Hope that helps.
Related
I am trying to consolidate my understanding of the difference between a patricia trie (radix tree with r = 2) and a binary trie. As far as I can see the implementation of a binary trie and a patricia trie (radix tree with r = 2) are identical?
Just for context, I understand fully how to implement radix trees and have written classes and tests for said implementations, I just don't understand how the implementations for a patricia trie (radix tree with r = 2) is different from a binary trie.
I have 2 separate guesses that I would like feedback on regarding the difference:
In a patricia trie (radix tree with r = 2) a node branch can have an edge key that is inside the set K = {0,1} with values inside the set V = {null, node<pointer>} (due the binary radix constraint); if there is a parent that only has single children descendants all the way down to the leaf (singly-linked list shaped subtree), the final edge pointing to a leaf node (which is also a word terminator node) can be a series of bits (e.g. 10010). This makes sense to me when writing this now because this would offer an explanation to what makes this a radix tree (a compression has been applied).
No compression difference between a binary trie and patricia trie (radix tree with r = 2). The only difference is that word termination nodes have value attributes holding an entire string of an added word. This could make sense because the difference may be that a binary trie only stores characters of inserted word by edges whereas patricia stores words at the terminator nodes as well as the edges. The problem with this guess is that how exactly is the patricia trie an optimised version compared to the binary trie? We have simply added to our space complexity.
I feel there are subtle details I am missing that could help me fully grasp the differences with confidence. I would appreciate any help!
I am trying to find out which will be more efficient in terms of speed of search, whether trie or B-Tree. I have a dictionary of English words and I want to locate a word in that dictionary efficiently.
If by "more efficient in time of search" you refer to theoretical time complexity, then B Tree offers O(logn * |S|)1 time complexity for search, while a trie offers O(|S|) time complexity, where |S| is the length of the searched string, and n is the number of elements in dictionary.
If by "more efficient in time of search" you refer to actual real life run time, that depends on the actual implementation, actual data and actual search behavior. Some examples that might influence the answer:
Size of data
Storage system (for example: RAM/Flah/disk/distributed filesystem/...)
Distribution of searches
Code optimizations of each implementation
(and much more)
(1) There are O(logn) comparisons, and each comparison takes O(|S|) times, since you need to traverse the entire string to decide which is higher (worst case analysis).
It depends on what's your need. If you want to get the whole subtree, a B+Tree is your best choice because it is space efficient and also the branching factor of the B+ Tree affects its performance (the number of intermediary nodes). If h is the height of the tree, then nmax ~~ bh. Therefore h ~~ log(nmax) / log(b).
With n = 1 000 000 000 and b = 100, we have h ~~ 5. Therefore it means only 5 pointer dereferencing for going from the root to the leaf. It's more cache-friendly than a Trie.
But if you want to get the first N children from a substree, then a Trie is the best choice because you simply visit less nodes than in a B+ Tree scenario.
Also the word prefix completion is well handled by trie.
I wanted to understand how red-black tree works. I understood the algorithm, how to fix properties after insert and delete operations, but something isn't clear to me. Why red-black tree is more balanced than binary tree? I want to understand the intuition, why rotations and fixing tree properties makes red-black tree more balanced.
Thanks.
Suppose you create a plain binary tree by inserting the following items in order: 1, 2, 3, 4, 5, 6, 7, 8, 9. Each new item will always be the largest item in the tree, and so inserted as the right-most possible node. You "tree" would look like this:
1
\
2
\
3
.
.
.
9
The rotations performed in a red-black tree (or any type of balanced binary tree) ensure that neither the left nor right subtree of any node is significantly deeper than the other (typically, the difference in height is 0 or 1, but any constant factor would do.) This way, operations whose running time depends on the height h of the tree are always O(lg n), since the rotations maintain the property that h = O(lg n), whereas in the worst case shown above h = O(n).
For a red-black tree in particular, the node coloring is simply a bookkeeping trick that help in proving that the rotations always maintain h = O(lg n). Different types of balanced binary trees (AVL trees, 2-3 trees, etc) use different bookkeeping techniques for maintaining the same property.
Why red-black tree is more balanced than binary search tree?
Because a red-black tree guarantees O(logN) performance for insertion, deletion and look ups for any order of operations.
Why rotations and fixing tree properties makes red-black tree more balanced?
Apart from the general properties that any binary search tree must obey, a red-black tree also obeys the following properties:
No node has two red links connected to it.
Every path from root to null link has the same number of black links.
Red links lean left.
Now we want to prove the following proposition :
Proposition. Height of tree is ≤ 2 lg N in the worst case.
Proof.
Since every path from the root to any null link has the same number of black links and two red links are never in-a-row, the maximum height will always be less than or equal to 2logN in the worst case.
Although quite late , but since I was recently studying RBT and was struggling with the intuition behind why some magical rotation and coloring balances the tree and was thinking the same question as the OP
why rotations and fixing tree properties makes red-black tree more balanced
After a few days of "research" , I had the eureka moment and decided to write it in details . I won't copy paste here as some formatting would be not right , so anyone who is interested , can check it from github . I tried to explain with a lot of images and simulation . Hope it helps someone someday who happens to trip in this thread searching the same question : )
Why is it important that a binary tree be balanced
Imagine a tree that looks like this:
A
\
B
\
C
\
D
\
E
This is a valid binary tree, but now most operations are O(n) instead of O(lg n).
The balance of a binary tree is governed by the property called skewness. If a tree is more skewed, then the time complexity to access an element of a the binary tree increases. Say a tree
1
/ \
2 3
\ \
7 4
\
5
\
6
The above is also a binary tree, but right skewed. It has 7 elements, so an ideal binary tree require O(log 7) = 3 lookups. But you need to go one more level deep = 4 lookups in worst case. So the skewness here is a constant 1. But consider if the tree has thousands of nodes. The skewness will be even more considerable in that case. So it is important to keep the binary tree balanced.
But again the skewness is the topic of debate as the probablity analysis of a random binary tree shows that the average depth of a random binary tree with n elements is 4.3 log n . So it is really the matter of balancing vs the skewness.
One more interesting thing, computer scientists have even found an advantage in the skewness and proposed a skewed datastructure called skew heap
To ensure log(n) search time, you need to divide the total number of down level nodes by 2 at each branch. For example, if you have a linear tree, never branching from root to the leaf node, then the search time will be linear as in a linked list.
An extremely unbalanced tree, for example a tree where all nodes are linked to the left, means you still search through every single node before finding the last one, which is not the point of a tree at all and has no benefit over a linked list. Balancing the tree makes for better search times O(log(n)) as opposed to O(n).
As we know that most of the operations on Binary Search Trees proportional to height of the Tree, So it is desirable to keep height small. It ensure that search time strict to O(log(n)) of complexity.
Rather than that most of the Tree Balancing Techniques available applies more to
trees which are perfectly full or close to being perfectly balanced.
At the end of the end you need the simplicity over your tree and go for best binary trees like red-black tree or avl
We always see operations on a (binary search) tree has O(logn) worst case running time because of the tree height is logn. I wonder if we are told that an algorithm has running time as a function of logn, e.g m + nlogn, can we conclude it must involve an (augmented) tree?
EDIT:
Thanks to your comments, I now realize divide-conquer and binary tree are so similar visually/conceptually. I had never made a connection between the two. But I think of a case where O(logn) is not a divide-conquer algo which involves a tree which has no property of a BST/AVL/red-black tree.
That's the disjoint set data structure with Find/Union operations, whose running time is O(N + MlogN), with N being the # of elements and M the number of Find operations.
Please let me know if I'm missing sth, but I cannot see how divide-conquer comes into play here. I just see in this (disjoint set) case that it has a tree with no BST property and a running time being a function of logN. So my question is about why/why not I can make a generalization from this case.
What you have is exactly backwards. O(lg N) generally means some sort of divide and conquer algorithm, and one common way of implementing divide and conquer is a binary tree. While binary trees are a substantial subset of all divide-and-conquer algorithms, the are a subset anyway.
In some cases, you can transform other divide and conquer algorithms fairly directly into binary trees (e.g. comments on another answer have already made an attempt at claiming a binary search is similar). Just for another obvious example, however, a multiway tree (e.g. a B-tree, B+ tree or B* tree), while clearly a tree is just as clearly not a binary tree.
Again, if you want to badly enough, you can stretch the point that a multiway tree can be represented as sort of a warped version of a binary tree. If you want to, you can probably stretch all the exceptions to the point of saying that all of them are (at least something like) binary trees. At least to me, however, all that does is make "binary tree" synonymous with "divide and conquer". In other words, all you accomplish is warping the vocabulary and essentially obliterating a term that's both distinct and useful.
No, you can also binary search a sorted array (for instance). But don't take my word for it http://en.wikipedia.org/wiki/Binary_search_algorithm
As a counter example:
given array 'a' with length 'n'
y = 0
for x = 0 to log(length(a))
y = y + 1
return y
The run time is O(log(n)), but no tree here!
Answer is no. Binary search of a sorted array is O(log(n)).
Algorithms taking logarithmic time are commonly found in operations on binary trees.
Examples of O(logn):
Finding an item in a sorted array with a binary search or a balanced search tree.
Look up a value in a sorted input array by bisection.
As O(log(n)) is only an upper bound also all O(1) algorithms like function (a, b) return a+b; satisfy the condition.
But I have to agree all Theta(log(n)) algorithms kinda look like tree algorithms or at least can be abstracted to a tree.
Short Answer:
Just because an algorithm has log(n) as part of its analysis does not mean that a tree is involved. For example, the following is a very simple algorithm that is O(log(n)
for(int i = 1; i < n; i = i * 2)
print "hello";
As you can see, no tree was involved. John, also provides a good example on how binary search can be done on a sorted array. These both take O(log(n)) time, and there are of other code examples that could be created or referenced. So don't make assumptions based on the asymptotic time complexity, look at the code to know for sure.
More On Trees:
Just because an algorithm involves "trees" doesn't imply O(logn) either. You need to know the tree type and how the operation affects the tree.
Some Examples:
Example 1)
Inserting or searching the following unbalanced tree would be O(n).
Example 2)
Inserting or search the following balanced trees would both by O(log(n)).
Balanced Binary Tree:
Balanced Tree of Degree 3:
Additional Comments
If the trees you are using don't have a way to "balance" than there is a good chance that your operations will be O(n) time not O(logn). If you use trees that are self balancing, then inserts normally take more time, as the balancing of the trees normally occur during the insert phase.