Trie vs red-black tree: which is better in space and time? - data-structures

Tries and red-black trees are very efficient for storing strings.
Which has better time complexity? How about space complexity?

This depends on many factors, such as which strings you are storing and how the trie is represented.
In a red-black tree, you will need to do O(log n) string comparisons on each operation (insertion, deletion, lookup, etc.). The cost of a comparison is small if you compare two strings that don't have a common prefix, or if you compare two strings where on string is small. The worst-case for a comparison is when one string is a prefix of another, in which case all the characters have to be read. Consequently, if you wanted to look up a string of length L in a red-black tree of strings, in the worst case you would do O(L log n) work by making O(log n) comparisons, each of which look at all the characters from the input string. However, in the best case this would require only O(log n) time, which happens if the comparisons always immediately fail.
In terms of space usage, the red/black tree would need two pointers per node and one node per string. (The red/black bit can usually be packed into the low bits of a pointer). Thus the total space is 2n + (total length of all strings).
In a trie, insertions, deletions, and lookups take O(L) in the worst-case (if all characters of the input must be considered) and O(1) in the best case (if you fall off the trie early). This is faster than the red-black tree by a factor of O(log n), which could potentially be significant for large collections. However, the trie has worse locality, since there is a lot more pointer-chasing involved and no contiguous arrays of characters to scan.
In terms of memory usage, a trie with an alphabet of size k typically needs a total of kn pointers, where n is the number of nodes. This can be dramatically worse than the red/black tree if the alphabet size is large. However, that space overhead can be reduced by compressing the trie using the Patricia tree representation, using a more efficient data structure to store child pointers, or building a DAWG from the trie.
Hope this helps!

Related

Why do we sort via Heaps instead of Binary Search Trees?

A heap can be constructed from a list in O(n logn) time, because inserting an element into a heap takes O(logn) time and there are n elements.
Similarly, a binary search tree can be constructed from a list in O(n logn) time, because inserting an element into a BST takes on average logn time and there are n elements.
Traversing a heap from min-to-max takes O(n logn) time (because we have to pop n elements, and each pop requires an O(logn) sink operation). Traversing a BST from min-to-max takes O(n) time (literally just inorder traversal).
So, it appears to me that constructing both structures takes equal time, but BSTs are faster to iterate over. So, why do we use "Heapsort" instead of "BSTsort"?
Edit: Thank you to Tobias and lrlreon for your answers! In summary, below are the points why we use heaps instead of BSTs for sorting.
Construction of a heap can actually be done in O(n) time, not O(nlogn) time. This makes heap construction faster than BST construction.
Additionally, arrays can be easily transformed into heaps in-place, because heaps are always complete binary trees. BSTs can't be easily implemented as an array, since BSTs are not guaranteed to be complete binary trees. This means that BSTs require additional O(n) space allocation to sort, while Heaps require only O(1).
All operations on heaps are guaranteed to be O(logn) time. BSTs, unless balanced, may have O(n) operations. Heaps are dramatically simpler to implement than Balanced BSTs are.
If you need to modify a value after creating the heap, all you need to do is apply the sink or swim operations. Modifying a value in a BST is much more conceptually difficult.
There are multiple reasons I can imagine you would want to prefer a (binary) heap over a search tree:
Construction: A binary heap can actually be constructed in O(n) time by applying the heapify operations bottom-up from the smallest to the largest subtrees.
Modification: All operations of the binary heap are rather straightforward:
Inserted an element at the end? Sift it up until the heap condition holds
Swapped the last element to the beginning? Swift it down until the heap condition holds
Changed the key of an entry? Sift it up or down depending on the direction of the change
Conceptual simplicity: Due to its implicit array representation, a binary heap can be implemented by anyone who knows the basic indexing scheme (2i+1, 2i+2 are the children of i) without considering many difficult special cases.
If you look at these operations in a binary search tree, in theory
they are also quite simple, but the tree has to be stored explicitly, e.g. using pointers, and most of the operations require the tree to be
rebalanced to preserve the O(log n) height, which requires complicated rotations (red black-trees) or splitting/merging
nodes (B-trees)
EDIT: Storage: As Irleon pointed out, to store a BST you also need more storage, as at least two child pointers need to be stored for every entry in addition to the value itself, which can be a large storage overhead especially for small value types. At the same time, the heap needs no additional pointers.
To answer your question about sorting: A BST takes O(n) time to traverse in-order, the construction process takes O(n log n) operations which, as mentioned before, are much more complex.
At the same time Heapsort can actually be implemented in-place by building a max-heap from the input array in O(n) time and and then repeatedly swapping the maximum element to tbe back and shrinking the heap. You can think of Heapsort as Insertion sort with a helpful data structure that lets you find the next maximum in O(log n) time.
If the sorting method consists of storing the elements in a data structure and after extracting in a sorted way, then, although both approaches (heap and bst) have the same asymptotic complexity O(n log n), the heap tends to be faster. The reason is the heap always is a perfectly balanced tree and its operations always are O(log n), in a determistic way, not on average. With bst's, depending on the approah for balancing, insertion and deletion tend to take more time than the heap, no matter which balancing approach is used. In addition, a heap is usually implemented with an array storing the level traversal of the tree, without the need of storing any kind of pointers. Thus, if you know the number of elements, which usually is the case, the extra storage required for a heap is less than the used for a bst.
In the case of sorting an array, there is a very important reason which it would rather be preferable a heap than a bst: you can use the same array for storing the heap; no need to use additional memory.

Data structure with O(1) insertion and O(log(n)) search complexity?

Is there any data structure available that would provide O(1) -- i.e. constant -- insertion complexity and O(log(n)) search complexity even in the worst case?
A sorted vector can do a O(log(n)) search but insertion would take O(n) (taken the fact that I am not always inserting the elements either at the front or the back). Whereas a list would do O(1) insertion but would fall short of providing O(log(n)) lookup.
I wonder whether such a data structure can even be implemented.
Yes, but you would have to bend the rules a bit in two ways:
1) You could use a structure that has O(1) insertion and O(1) search (such as the CritBit tree, also called bitwise trie) and add artificial cost to turn search into O(log n).
A critbit tree is like a binary radix tree for bits. It stores keys by walking along the bits of a key (say 32bits) and use the bit to decide whether to navigate left ('0') or right ('1') at every node. The maximum complexity for search and insertion is both O(32), which becomes O(1).
2) I'm not sure that this is O(1) in a strict theoretical sense, because O(1) works only if we limit the value range (to, say, 32 bit or 64 bit), but for practical purposes, this seems a reasonable limitation.
Note that the perceived performance will be O(log n) until a significant part of the possible key permutations are inserted. For example, for 16 bit keys you probably have to insert a significant part of 2^16 = 65563 keys.
No (at least in a model where the elements stored in the data structure can be compared for order only; hashing does not help for worst-case time bounds because there can be one big collision).
Let's suppose that every insertion requires at most c comparisons. (Heck, let's make the weaker assumption that n insertions require at most c*n comparisons.) Consider an adversary that inserts n elements and then looks up one. I'll describe an adversarial strategy that, during the insertion phase, forces the data structure to have Omega(n) elements that, given the comparisons made so far, could be ordered any which way. Then the data structure can be forced to search these elements, which amount to an unsorted list. The result is that the lookup has worst-case running time Omega(n).
The adversary's goal is to give away as little information as possible. Elements are sorted into three groups: winners, losers, and unknown. Initially, all elements are in the unknown group. When the algorithm compares two unknown elements, one chosen arbitrarily becomes a winner and the other becomes a loser. The winner is deemed greater than the loser. Similarly, unknown-loser, unknown-winner, and loser-winner comparisons are resolved by designating one of the elements a winner and the other a loser, without changing existing designations. The remaining cases are loser-loser and winner-winner comparisons, which are handled recursively (so the winners' group has a winner-unknown subgroup, a winner-winners subgroup, and a winner-losers subgroup). By an averaging argument, since at least n/2 elements are compared at most 2*c times, there exists a subsub...subgroup of size at least n/2 / 3^(2*c) = Omega(n). It can be verified that none of these elements are ordered by previous comparisons.
I wonder whether such a data structure can even be implemented.
I am afraid the answer is no.
Searching OK, Insertion NOT
When we look at the data structures like Binary search tree, B-tree, Red-black tree and AVL tree, they have average search complexity of O(log N), but at the same time the average insertion complexity is same as O(log N). Reason is obvious, the search will follow (or navigate through) the same pattern in which the insertion happens.
Insertion OK, Searching NOT
Data structures like Singly linked list, Doubly linked list have average insertion complexity of O(1), but again the searching in Singly and Doubly LL is painful O(N), just because they don't have any indexing based element access support.
Answer to your question lies in the Skiplist implementation, which is a linked list, still it needs O(log N) on average for insertion (when lists are expected to do insertion in O(1)).
On closing notes, Hashmap comes very close to meet the speedy search and speedy insertion requirement with the cost of huge space, but if horribly implemented, it can result into a complexity of O(N) for both insertion and searching.

What is the running time and space complexity of a huffman decode algorithm?

Say we started with a text file like:
a 00
b 01
c 10
d 11
00000001011011
The algorithm would be the typical one where you use the prefixes to build a Huffman tree, read in the encoded bits while traversing the tree until you reach a leaf, then returning the character in at that leaf.
Could someone explain how I would determine the running time and space complexity?
Basically there are three methods on a Huffman Tree, construction, encoding, and decoding. The time complexity may vary from each other.
We should first notice that (see Wikipedia [link]):
In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large.
The complexity of construction is linear (O(n)) if input probabilities are sorted, see this paper. In most cases, we use a greedy O(n*log(n)) construction method:
http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
If you build a bidirection hashtable for all symbols, both encoding and decoding would be constant (O(1)).
Assume an encoded text string of length n and an alphabet of k symbols.
For every encoded symbol you have to traverse the tree in order to decode that symbol. The tree contains k nodes and, on average, it takes O(log k) node visits to decode a symbol. So the time complexity would be O(n log k).
Space complexity is O(k) for the tree and O(n) for the decoded text.

LSM Tree lookup time

What's the worst case time complexity in a log-structured merge tree for a simple search query (like querying a single WHERE clause)?
Is it O(log N)? O(N*Log N)? Something else?
How about for a multiple query, like searching for multiple WHERE clauses in a key-value database?
The wikipedia page on LSM trees is currently lacking this info.
And I'm trying to make sense of the original paper.
I have been wondering the same.
If you have a series of trees, getting smaller by a constant factor each time, and you need to search them all for a single key, the cost seems to be O(log(N)^2).
Say the first (binary) tree takes log_2(N) branches to reach a node. The second might be half the size, and take (log_2(N) - 1) branches to find a node. The smallest tree will be some O(1) constant in size and there are roughly log_2(N) trees total. Summing the series gives O(log_2(N)^2).
However, I'm wondering if there is some more clever scheme where arbitrary single-key lookups, insertions or deletions have amortized cost O(log(N)), but haven't been able to find an answer (yet).
For a simple search indexed by a LSM tree, it is O(log n). This is because the biggest tree in the LSM tree is a B tree, which is O(log n), and the other trees are subsets of B trees or in the case of in memory trees, more efficient trees, which are no worse than O(log n). The number of trees is a constant, so it doesn't affect the order of the search time.

O(1) extra space lookup data structure

I was wondering if there was a simple data structure that supports amortized log(n) lookup and insertion like a self balancing binary search tree but with constant memory overhead. (I don't really care about deleting elements).
One idea I had was to store everything in one contiguous block of memory divided into two contiguous blocks: an S part where all elements are sorted, and a U that isn't sorted.
To perform an insertion, we could add an element to U, and if the size of U exceeds log(size of S), then you sort the entire contiguous array (treat both S and U as one contiguous array), so that after the sort everything is in S and U is empty.
To perform lookup run binary search on S and just look through all of U.
However, I am having trouble calculating the amortized insertion time of my algorithm.
Ultimately I would just appreciate some reasonably simple algorithm/datastructure with desired properties, and some guarantee that it runs reasonably fast in amortized time.
Thank you!
If by constant amount of memory overhead you mean that for N elements stored in the data-structure the space consumption should be O(N), then any balanced tree will do -- in fact, any n-ary tree storing the elements in external leaves, where n > 1 and every external tree contains an element, has this property.
This follows from the fact that any tree graph with N nodes has N - 1 edges.
If by constant amount of memory overhead you mean that for N elements the space consumption should be N + O(1), then neither the balanced trees nor the hash tables have this property -- both will use k * N memory, where k > 1 due to extra node pointers in the case of trees and the load factor in the case of hash tables.
I find your approach interesting, but I do not think it will work even if you only sort U, and then merge the two sets in linear time. You would need to do a sort (O(logN * log(logN)) operations) after every logN updates, followed by an O(n) merging of S and U (note that so far nobody actually knows how to do this in linear time in place, that is, without an extra array).
The amortized insertion time would be O(n / logN). But you could maybe use your approach to achieve something close to O(√n) if you allow the size of U to grow to the square root of S.
Any hashtable will do that. The only tricky part about it is how you resolve conflicts - there are few ways of doing it, the other tricky part is correct hash computing.
See:
http://en.wikipedia.org/wiki/Hash_table

Resources