In general suffix trees are said to be less space efficient than suffix array. More specifically the approximation upper bound O(n) space efficiency hides a factor of 20 compared with that of a suffix array which approximates 4. Why this is happening?
Typically, a suffix tree is represented by having each node store one pointer per character in the alphabet, with that pointer indicating where the child node is for the indicated character. Each child pointer is also annotated with a pair of indices into the original string indicating what range of characters from the original string is used to label the given edge. This means that for each character in your alphabet (plus the $ character), each suffix tree node will need to store one pointer and two machine words. This means that if you're doing something in a computational genomics application where the alphabet is {A, C, T, G}, for example, you'd need fifteen machine words per node in the suffix tree. The number of nodes in a suffix tree is at most 2n - 1, where n is the number of suffixes of the string, so you're talking about needing roughly 30n machine words.
Contrast this with a suffix array, where for each character in the string you just need one machine word (the index of the suffix), so there are a total of n machine words needed to store the suffix array. This is a substantial savings over the original suffix tree. Usually, suffix arrays are paired with LCP arrays (which give more insight into the structure of the array), which requires another n - 1 machine words, so you're coming out to a total of roughly 2n - 1 machine words needed. This is a huge savings over the suffix tree, which is one of the reasons why suffix arrays are used so much in practice.
Related
While it is hard to find an unanimous definition of "Radix Tree", most accepted definitions of Radix Tree indicate that it is a compacted Prefix Tree. What I'm struggling to understand is the significance of the term "radix" in this case. Why compacted prefix trees are so named (i.e. Radix Tree) and non-compacted ones are not called Radix Tree?
Wikipedia can answer this, https://en.wikipedia.org/wiki/Radix:
In mathematical numeral systems, the radix or base is the number of
unique digits, including zero, used to represent numbers in a
positional numeral system. For example, for the decimal system (the
most common system in use today) the radix is ten, because it uses the
ten digits from 0 through 9.
and the tree https://en.wikipedia.org/wiki/Radix_tree:
a data structure that represents a space-optimized trie in which each
node that is the only child is merged with its parent. The result is
that the number of children of every internal node is at least the
radix r of the radix tree, where r is a positive integer and a power x
of 2, having x ≥ 1
Finally check a dictionary:
1.radix(Noun)
A primitive word, from which other words spring.
The radix in the radix tree determines the balance between the amount of children (or depth) of the tree and the 'sparseness', or how many suffixes are unique.
EDIT - elaboration
the number of children of every internal node is at least the radix r
Let's consider the words "aba,abnormal,acne, and abysmal". In a regular prefix tree (or trie), every arc adds a single letter to the word, so we have:
-a-b-a-
n-o-r-m-a-l-
y-s-m-a-l-
-c-n-e-
My drawing is a bit misleading - in tries the letters usually sit on arcs, so '-' is a node and the letters are edges. Note many internal nodes have one child! Now the compact (and obvious) form:
-a-b -a-
normal-
ysmal-
cne-
Well now we have an inner node (behind b) with 3 children! The radix is a positive power of 2, so 2 in this case. Why 2 and not say 3? Well first note the root has 2 children. In addition, suppose we want to add a word. Options:
shares the b prefix - well, 4 is greater than 2.
shares an edge of a child of b - say 'abnormally'. Well The way insertion works the shared part will split and we'll have:
Relevant branch:
-normal-ly-
-
normal is an inner node now, but has 2 children (one a leaf).
- another case would be deleting acne for example. But now the compactness property says The node after b must be merged back, since it's the only child, so the tree becomes
tree:
-ab-a
-normal-ly-
-
-ysmal
so, we still maintain children>2.
Hope this clarifies!
Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.
I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.
Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.
I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.
In Suffix-Arrays you usually don't use strings, just one string. That will be the concatenated version of several strings with some endtoken (a different one for every string). For the Suffix Arrays, you use pointers (or the array index) to reference the suffix (only the position for the first token/character is needed).
So the space required is the array + for each suffix the pointer. (that is just a pretty simple implementation, you should do more, to get more performance).
In that case you could optimise the sorting algorithm for the suffixes, since you only need to sort those suffixes the pointers are referencing to, till the endtokens. Everything behind the endtoken does not need to be used in the sorting algorithm.
After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.
If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.
The resulting structure is still useful, but not as efficient as a normal suffix array.
From Iowa State University, taken from Prefix.pdf:
Suffix trees and suffix arrays can be generalized to multiple strings.
The generalized suffix tree of a set of strings S = {s1, s2, . . . ,
sk}, denoted GST(S) or simply GST, is a compacted trie of all suffixes
of each string in S. We assume that the unique termination character $
is appended to the end of each string. A leaf label now consists of a
pair of integers (i, j), where i denotes the suffix is from string si
and j denotes the starting position of the suffix in si . Similarly,
an edge label in a GST is a substring of one of the strings. An edge
label is represented by a triplet of integers (i, j, l), where i
denotes the string number, and j and l denote the starting and ending
positions of the substring in si . For convenience of understanding,
we will continue to show the actual edge labels. Note that two strings
may have identical suffixes. This is compensated by allowing leaves in
the tree to have multiple labels. If a leaf is multiply labelled, each
suffix should come from a different string. If N is the total number
of characters (including the $ in each string) of all strings in S,
the GST has at most N leaf nodes and takes up O(N) space. The
generalized suffix array of S, denoted GSA(S) or simply GSA, is a
lexicographically sorted array of all suffixes of each string in S.
Each suffix is represented by an integer pair (i, j) denoting suffix
starting from position j in si . If suffixes from different strings
are identical, they occupy consecutive positions in the GSA. For
convenience, we make an exception for the suffix $ by listing it only
once, though it occurs in each string. The GST and GSA of strings
apple and maple are shown in Figure 1.2.
Here you have an article about an algorithm to construct a GSA:
Generalized enhanced suffix array construction in external memory
I'm currently trying to get my head around the variations of Trie and was wondering if anyone would be able to clarify a couple of points. (I got quite confused with the answer to this question Tries versus ternary search trees for autocomplete?, especially within the first paragraph).
From what I have read, is the following correct? Supposing we have stored n elements in the data structures, and L is the length of the string we are searching for:
A Trie stores its keys at the leaf nodes, and if we have a positive hit for the search then this means that it will perform O(L) comparisons. For a miss however, the average performance is O(log2(n)).
Similarly, a Radix tree (with R = 2^r) stores the keys at the leaf nodes and will perform O(L) comparisons for a positive hit. However misses will be quicker, and occur on average in O(logR(n)).
A Ternary Search Trie is essentially a BST with operations <,>,= and with a character stored in every node. Instead of comparing the whole key at a node (as with BST), we only compare a character of the key at that node. Overall, supposing our alphabet size is A, then if there is a hit we must perform (at most) O(L*A) = O(L) comparisons. If there is not a hit, on average we have O(log3(n)).
With regards to the Radix tree, if for example our alphabet is {0,1} and we set R = 4, for a binary string 0101 we would only need two comparisons right? Hence if the size of our alphabet is A, we would actually only perform L * (A / R) comparisons? If so then I guess this just becomes O(L), but I was curious if this was correct reasoning.
Thanks for any help you folks can give!
This is the question:
Given a flat text file that contains a range of IP addresses that map
to a location (e.g.
192.168.0.0-192.168.0.255 = Boston, MA), come up with an algorithm that will find a city for a specific ip address if a mapping exists.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits) and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
Any faults in my plans, or better solutions?
Your approach seems perfectly reasonable.
If you are interested in doing a bit of research / extra coding, there are algorithms that will asymptotically outperform the standard binary search technique that rely on the fact that your IP addresses can be interpreted as integers in the range from 0 to 231 - 1. For example, the van Emde Boas tree and y-Fast Trie data structures can implement the predecessor search operation that you're looking at in time O(log log U), where U is the maximum possible IP address, as opposed to the O(log N) approach that binary search uses. The constant factors are higher, though, which means that there is no guarantee that this approach will be any faster. However, it might be worth exploring as another approach that could potentially be even faster.
Hope this helps!
The problem smells of ranges, and one of the good data-structures for this problem would be a Segment Tree. Some resources to help you get started.
The root of the segment tree can represent the addresses (0.0.0.0 - 255.255.255.255). The left sub-tree would represent the addresses (0.0.0.0 - 127.255.255.255) and the right sub-tree would represent the range (128.0.0.0 - 255.255.255.255), and so on. This will go on till we reach ranges which cannot be further sub-divided. Say, if we have the range 32.0.0.0 - 63.255.255.255, mapped to some arbitrary city, it will be a leaf node, we will not further subdivide that range when we arrive there, and tag it to the specific city.
To search for a specific mapping, we follow the tree, just as we do in a Binary Search Tree. If your IP lies in the range of the left sub-tree, move to the left sub-tree, else move to the right sub-tree.
The good parts:
You need not have all sub-trees, only add the sub-trees which are required. For example, if in your data, there is no city mapped for the range (0.0.0.0 - 127.255.255.255), we will not construct that sub-tree.
We are space efficient. If the entire range is mapped to one city, we will create only the root node!
This is a dynamic data-structure. You can add more cities, split-up ranges later on, etc.
You will be making constant number of operations, since the maximum depth of the tree would be 4 x log2(256) = 32. For this particular problem it turns out that Segment Trees would be as fast as van-Emde Boas trees, and require lesser space (O(N)).
This is a simple, but non-trivial data-structure, which is better than sorting, because it is dynamic, and easier to explain to your interviewer than van-Emde Boas trees.
This is one of the easiest non-trivial data-structures to code :)
Please note that in some Segment Tree tutorials, they use arrays to represent the tree. This is probably not what you want, since we would not be populating the entire tree, so dynamically allocating nodes, just like we do in a standard Binary Tree is the best.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits)...
If following this approach, you would probably want to multiply by 256^3, 256^2, 256 and 1 respectively for A, B, C and D in an address A.B.C.D. That effectively recreates the IP address as a 32-bit unsigned number.
... and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
I would suggest creating a contiguous array (a std::vector) containing structs with the lower and upper ranges (and location name - discussed below). Then as you say you can binary search for a range including a specific value, without any odd/even hassles.
Using the lower end of the range as a key in a hash is one way to avoid having space for the location names in the array, but given the average number of characters in a city name, the likely size of pointers, a choice between a sparsely populated hash table and lengthly displacement lists to search in successive alternative buckets or further indirection to arbitrary length containers - you'd need to be pretty desperate to bother trying. In the first instance, storing the location in struct alongside the IP value range seems good.
Alternatively, you could create a tree based on e.g. the individual 0-255 IP values: each level in the tree could be either an array of 256 values for direct indexing, or a sorted array of populated values. That can reduce the number of IP value comparisons you're likely to need to make (O(log2N) to O(1)).
In your example, 192.168.0.0-192.168.0.255 = Boston, MA.
Will the first three octets (192.168.0) be the same for both IP addresses in the entry?
Also, will the first three octets be unique for a city?
If so, then this problem can solved more easily