What is the difference between trie and radix trie data structures?

What is the difference between trie and radix trie data structures? - algorithm

Are the trie and radix trie data structures the same thing?
If they aren't the same, then what is the meaning of radix trie (AKA Patricia trie)?

A radix tree is a compressed version of a trie. In a trie, on each edge you write a single letter, while in a PATRICIA tree (or radix tree) you store whole words.
Now, assume you have the words hello, hat and have. To store them in a trie, it would look like:
e - l - l - o
/
h - a - t
\
v - e
And you need nine nodes. I have placed the letters in the nodes, but in fact they label the edges.
In a radix tree, you will have:
*
/
(ello)
/
* - h - * -(a) - * - (t) - *
\
(ve)
\
*
and you need only five nodes. In the picture above nodes are the asterisks.
So, overall, a radix tree takes less memory, but it is harder to implement. Otherwise the use case of both is pretty much the same.

My question is whether Trie data structure and Radix Trie are the same thing?
In short, no. The category Radix Trie describes a particular category of Trie, but that doesn't mean that all tries are radix tries.
If they are[n't] same, then what is the meaning of Radix trie (aka Patricia Trie)?
I assume you meant to write aren't in your question, hence my correction.
Similarly, PATRICIA denotes a specific type of radix trie, but not all radix tries are PATRICIA tries.
What is a trie?
"Trie" describes a tree data structure suitable for use as an associative array, where branches or edges correspond to parts of a key. The definition of parts is rather vague, here, because different implementations of tries use different bit-lengths to correspond to edges. For example, a binary trie has two edges per node that correspond to a 0 or a 1, while a 16-way trie has sixteen edges per node that correspond to four bits (or a hexidecimal digit: 0x0 through to 0xf).
This diagram, retrieved from Wikipedia, seems to depict a trie with (at least) the keys 'A', 'to', 'tea', 'ted', 'ten', 'i', 'in' and 'inn' inserted:
If this trie were to store items for the keys 't' or 'te' there would need to be extra information (the numbers in the diagram) present at each node to distinguish between nullary nodes and nodes with actual values.
What is a radix trie?
"Radix trie" seems to describe a form of trie that condenses common prefix parts, as Ivaylo Strandjev described in his answer. Consider that a 256-way trie which indexes the keys "smile", "smiled", "smiles" and "smiling" using the following static assignments:
root['s']['m']['i']['l']['e']['\0'] = smile_item;
root['s']['m']['i']['l']['e']['d']['\0'] = smiled_item;
root['s']['m']['i']['l']['e']['s']['\0'] = smiles_item;
root['s']['m']['i']['l']['i']['n']['g']['\0'] = smiling_item;
Each subscript accesses an internal node. That means to retrieve smile_item, you must access seven nodes. Eight node accesses correspond to smiled_item and smiles_item, and nine to smiling_item. For these four items, there are fourteen nodes in total. They all have the first four bytes (corresponding to the first four nodes) in common, however. By condensing those four bytes to create a root that corresponds to ['s']['m']['i']['l'], four node accesses have been optimised away. That means less memory and less node accesses, which is a very good indication. The optimisation can be applied recursively to reduce the need to access unnecessary suffix bytes. Eventually, you get to a point where you're only comparing differences between the search key and indexed keys at locations indexed by the trie. This is a radix trie.
root = smil_dummy;
root['e'] = smile_item;
root['e']['d'] = smiled_item;
root['e']['s'] = smiles_item;
root['i'] = smiling_item;
To retrieve items, each node needs a position. With a search key of "smiles" and a root.position of 4, we access root["smiles"[4]], which happens to be root['e']. We store this in a variable called current. current.position is 5, which is the location of the difference between "smiled" and "smiles", so the next access will be root["smiles"[5]]. This brings us to smiles_item, and the end of our string. Our search has terminated, and the item has been retrieved, with just three node accesses instead of eight.
What is a PATRICIA trie?
A PATRICIA trie is a variant of radix tries for which there should only ever be n nodes used to contain n items. In our crudely demonstrated radix trie pseudocode above, there are five nodes in total: root (which is a nullary node; it contains no actual value), root['e'], root['e']['d'], root['e']['s'] and root['i']. In a PATRICIA trie there should only be four. Let's take a look at how these prefixes might differ by looking at them in binary, since PATRICIA is a binary algorithm.
smile: 0111 0011 0110 1101 0110 1001 0110 1100 0110 0101 0000 0000 0000 0000
smiled: 0111 0011 0110 1101 0110 1001 0110 1100 0110 0101 0110 0100 0000 0000
smiles: 0111 0011 0110 1101 0110 1001 0110 1100 0110 0101 0111 0011 0000 0000
smiling: 0111 0011 0110 1101 0110 1001 0110 1100 0110 1001 0110 1110 0110 0111 ...
Let us consider that the nodes are added in the order they are presented above. smile_item is the root of this tree. The difference, bolded to make it slightly easier to spot, is in the last byte of "smile", at bit 36. Up until this point, all of our nodes have the same prefix. smiled_node belongs at smile_node[0]. The difference between "smiled" and "smiles" occurs at bit 43, where "smiles" has a '1' bit, so smiled_node[1] is smiles_node.
Rather than using NULL as branches and/or extra internal information to denote when a search terminates, the branches link back up the tree somewhere, so a search terminates when the offset to test decreases rather than increasing. Here's a simple diagram of such a tree (though PATRICIA really is more of a cyclic graph, than a tree, as you'll see), which was included in Sedgewick's book mentioned below:
A more complex PATRICIA algorithm involving keys of variant length is possible, though some of the technical properties of PATRICIA are lost in the process (namely that any node contains a common prefix with the node prior to it):
By branching like this, there are a number of benefits: Every node contains a value. That includes the root. As a result, the length and complexity of the code becomes a lot shorter and probably a bit faster in reality. At least one branch and at most k branches (where k is the number of bits in the search key) are followed to locate an item. The nodes are tiny, because they store only two branches each, which makes them fairly suitable for cache locality optimisation. These properties make PATRICIA my favourite algorithm so far...
I'm going to cut this description short here, in order to reduce the severity of my impending arthritis, but if you want to know more about PATRICIA you can consult books such as "The Art of Computer Programming, Volume 3" by Donald Knuth, or any of the "Algorithms in {your-favourite-language}, parts 1-4" by Sedgewick.

TRIE:
We can have a search scheme where instead of comparing a whole search key with all existing keys (such as a hash scheme), we could also compare each character of the search key. Following this idea, we can build a structure (as shown below) which has three existing keys – “dad”, “dab”, and ”cab”.
[root]
...// | \\...
| \
c d
| \
[*] [*]
...//|\. ./|\\... Fig-I
a a
/ /
[*] [*]
...//|\.. ../|\\...
/ / \
B b d
/ / \
[] [] []
(cab) (dab) (dad)
This is essentially an M-ary tree with internal node, represented as [ * ] and leaf node, represented as [ ].
This structure is called a trie. The branching decision at each node can be kept equal to the number of unique symbols of the alphabet, say R. For lower case English alphabets a-z, R=26; for extended ASCII alphabets, R=256 and for binary digits/strings R=2.
Compact TRIE:
Typically, a node in a trie uses an array with size=R and thus causes waste of memory when each node has fewer edges. To circumvent the memory concern, various proposals were made. Based on those variations trie are also named as “compact trie” and “compressed trie”. While a consistent nomenclature is rare, a most common version of a compact trie is formed by grouping all edges when nodes have single edge. Using this concept, the above (Fig-I) trie with keys “dad”, “dab”, and ”cab” can take below form.
[root]
...// | \\...
| \
cab da
| \
[ ] [*] Fig-II
./|\\...
| \
b d
| \
[] []
Note that each of ‘c’, ‘a’, and ‘b’ is sole edge for its corresponding parent node and therefore, they are conglomerated into a single edge “cab”. Similarly, ‘d’ and a’ are merged into single edge labelled as “da”.
Radix Trie:
The term radix, in Mathematics, means a base of a number system, and it essentially indicates the number of unique symbols needed to represent any number in that system. For example, decimal system is radix ten,and binary system is radix two. Using the similar concept, when we’re interested in characterizing a data structure or an algorithm by the number of unique symbols of the underlying representational system, we tag the concept with the term “radix”. For example, “radix sort” for certain sorting algorithm. In the same line of logic, all variants of trie whose characteristics (such as depth, memory need, search miss/hit runtime, etc.) depend on radix of the underlying alphabets, we may call them radix “trie’s”. For example, an un-compacted as well as a compacted trie when uses alphabets a-z, we can call it a radix 26 trie. Any trie that uses only two symbols (traditionally ‘0’ and ‘1’) can be called a radix 2 trie. However, somehow many literatures restricted the use of the term “Radix Trie” only for the compacted trie.
Prelude to PATRICIA Tree/Trie:
It would be interesting to notice that even strings as keys can be represented using binary-alphabets. If we assume ASCII encoding, then a key “dad” can be written in binary form by writing the binary representation of each character in sequence, say as “011001000110000101100100” by writing binary forms of ‘d’, ‘a’, and ‘d’ sequentially.
Using this concept, a trie (with Radix Two) can be formed. Below we depict this concept using a simplified assumption that the letters ‘a’,’b’,’c’, and’d’ are from a smaller alphabet instead of ASCII.
Note for Fig-III:
As mentioned, to make the depiction easy, let’s assume an alphabet with only 4 letters {a,b,c,d} and their corresponding binary representations are “00”, “01”, “10” and “11” respectively. With this, our string keys “dad”, “dab”, and ”cab” become “110011”, “110001”, and “100001” respectively. The trie for this will be as shown below in Fig-III (bits are read from left to right just like strings are read from left to right).
[root]
\1
\
[*]
0/ \1
/ \
[*] [*]
0/ /
/ /0
[*] [*]
0/ /
/ /0
[*] [*]
0/ 0/ \1 Fig-III
/ / \
[*] [*] [*]
\1 \1 \1
\ \ \
[] [] []
(cab) (dab) (dad)
PATRICIA Trie/Tree:
If we compact the above binary trie (Fig-III) using single edge compaction, it would have much less nodes than shown above and yet, the nodes would still be more than the 3, the number of keys it contains. Donald R. Morrison found (in 1968) an innovative way to use binary trie to depict N keys using only N nodes and he named this data structure PATRICIA. His trie structure essentially got rid of single edges (one-way branching); and in doing so, he also got rid of the notion of two kinds of nodes – inner nodes (that don’t depict any key) and leaf nodes (that depict keys). Unlike the compaction logic explained above, his trie uses a different concept where each node includes an indication of how many bits of a key are to be skipped to make the branching decision. Yet another characteristic of his PATRICIA trie is that it doesn’t store the keys – which means such data structure will not be suitable for answering questions like, list all keys that match a given prefix, but is good for finding if a key exists or not in the trie. Nonetheless, the term Patricia Tree or Patricia Trie has, since then, been used in many different but similar senses, such as, to indicate a compact trie [NIST], or to indicate a radix trie with radix two [as indicated in a subtle way in WIKI] and so on.
Trie that may not be a Radix Trie:
Ternary Search Trie (aka Ternary Search Tree) often abbreviated as TST is a data structure (proposed by J. Bentley and R. Sedgewick) which looks very similar to a trie with three-way branching. For such tree, each node has a characteristic alphabet ‘x’ so that branching decision is driven by whether a character of a key is less than, equal to or greater than ‘x’. Due to this fixed 3-way branching feature, it provides a memory-efficient alternative for trie, especially when R (radix) is very large such as for Unicode alphabets. Interestingly, the TST, unlike (R-way) trie, doesn’t have its characteristics influenced by R. For example, search miss for TST is ln(N) as opposed logR(N) for R-way Trie. Memory requirements of TST, unlike R-way trie is NOT a function of R as well. So we should be careful to call a TST a radix-trie. I, personally, don’t think we should call it a radix-trie since none (as far as I know) of its characteristics are influenced by the radix,R, of its underlying alphabets.

In tries, most of the nodes don’t store keys and are just hops on a
path between a key and the ones that extend it. Most of these hops are
necessary, but when we store long words, they tend to produce long
chains of internal nodes, each with just one child. This is the main
reason tries need too much space, sometimes more than BSTs.
Radix tries (aka radix trees, aka Patricia trees) are based on the
idea that we can somehow compress the path, for example after
"intermediate t node", we could have "hem" in one node, or "idote" in
one node.
Here is a graph to compare trie vs radix trie:
The original trie has 9 nodes and 8 edges, and if we assume 9 bytes
for an edge, with a 4-byte overhead per node, this means
9 * 4 + 8 * 9 = 108 bytes.
The compressed trie on the right has 6 nodes and 5 edges but in this
case each edge carries a string, not just a character; however, we can
simplify the operation by accounting for edge references and string
labels separately. This way, we would still count 9 bytes per edge
(because we would include the string terminator byte in the edge
cost), but we could add the sum of string lengths as a third term in
the final expression; the total number of bytes needed is given by
6 * 4 + 5 * 9 + 8 * 1 = 77 bytes.
For this simple trie, the compressed version requires 30% less
memory.
Reference

Related

Significance of the term "Radix" in Radix Tree

While it is hard to find an unanimous definition of "Radix Tree", most accepted definitions of Radix Tree indicate that it is a compacted Prefix Tree. What I'm struggling to understand is the significance of the term "radix" in this case. Why compacted prefix trees are so named (i.e. Radix Tree) and non-compacted ones are not called Radix Tree?

Wikipedia can answer this, https://en.wikipedia.org/wiki/Radix:
In mathematical numeral systems, the radix or base is the number of
unique digits, including zero, used to represent numbers in a
positional numeral system. For example, for the decimal system (the
most common system in use today) the radix is ten, because it uses the
ten digits from 0 through 9.
and the tree https://en.wikipedia.org/wiki/Radix_tree:
a data structure that represents a space-optimized trie in which each
node that is the only child is merged with its parent. The result is
that the number of children of every internal node is at least the
radix r of the radix tree, where r is a positive integer and a power x
of 2, having x ≥ 1
Finally check a dictionary:
1.radix(Noun)
A primitive word, from which other words spring.
The radix in the radix tree determines the balance between the amount of children (or depth) of the tree and the 'sparseness', or how many suffixes are unique.
EDIT - elaboration
the number of children of every internal node is at least the radix r
Let's consider the words "aba,abnormal,acne, and abysmal". In a regular prefix tree (or trie), every arc adds a single letter to the word, so we have:
-a-b-a-
n-o-r-m-a-l-
y-s-m-a-l-
-c-n-e-
My drawing is a bit misleading - in tries the letters usually sit on arcs, so '-' is a node and the letters are edges. Note many internal nodes have one child! Now the compact (and obvious) form:
-a-b -a-
normal-
ysmal-
cne-
Well now we have an inner node (behind b) with 3 children! The radix is a positive power of 2, so 2 in this case. Why 2 and not say 3? Well first note the root has 2 children. In addition, suppose we want to add a word. Options:
shares the b prefix - well, 4 is greater than 2.
shares an edge of a child of b - say 'abnormally'. Well The way insertion works the shared part will split and we'll have:
Relevant branch:
-normal-ly-
-
normal is an inner node now, but has 2 children (one a leaf).
- another case would be deleting acne for example. But now the compactness property says The node after b must be merged back, since it's the only child, so the tree becomes
tree:
-ab-a
-normal-ly-
-
-ysmal
so, we still maintain children>2.
Hope this clarifies!

Why hidden factor on suffix tree space efficiency is 20?

In general suffix trees are said to be less space efficient than suffix array. More specifically the approximation upper bound O(n) space efficiency hides a factor of 20 compared with that of a suffix array which approximates 4. Why this is happening?

Typically, a suffix tree is represented by having each node store one pointer per character in the alphabet, with that pointer indicating where the child node is for the indicated character. Each child pointer is also annotated with a pair of indices into the original string indicating what range of characters from the original string is used to label the given edge. This means that for each character in your alphabet (plus the $ character), each suffix tree node will need to store one pointer and two machine words. This means that if you're doing something in a computational genomics application where the alphabet is {A, C, T, G}, for example, you'd need fifteen machine words per node in the suffix tree. The number of nodes in a suffix tree is at most 2n - 1, where n is the number of suffixes of the string, so you're talking about needing roughly 30n machine words.
Contrast this with a suffix array, where for each character in the string you just need one machine word (the index of the suffix), so there are a total of n machine words needed to store the suffix array. This is a substantial savings over the original suffix tree. Usually, suffix arrays are paired with LCP arrays (which give more insight into the structure of the array), which requires another n - 1 machine words, so you're coming out to a total of roughly 2n - 1 machine words needed. This is a huge savings over the suffix tree, which is one of the reasons why suffix arrays are used so much in practice.

Is there any tree for optimal prefix code other than Huffman tree? Will the height of that be the same as that of Huffman tree?

I have known that Huffman Tree is a kind of tree used for optimal prefix code, but is there any tree for optimal prefix code other than Huffman tree? If there are some that kind of trees, will their heights be the same?
Many thanks!

Huffman trees are constructed recursively by taking the two currently lowest probability symbols and combining them.
If there are other symbols with the same low probability, then these symbols can be combined instead.
This means that the final tree is not uniquely defined and there are multiple optimal prefix codes with potentially different heights.
For example, consider the symbols and probabilities below:
A 1/3
B 1/3
C 1/6
D 1/6
Can be encoded as:
A 0
B 10
C 110
D 111
or
A 00
B 01
C 10
D 11
Both encodings have an expected number of bits per symbol equal to 2, but different heights.
However, all optimal prefix codes can be constructed by the Huffman algorithm for a suitable choice of ordering with respect to probability ties.

Within the constraints of the Huffman code problem, i.e. representation of each symbol by a prefix-unique sequence of bits, then there is exactly one optimal total number of bits that can be achieved, and the Huffman algorithm achieves that. There are other approaches that arrive at the same answer.
As Peter de Rivaz noted, in certain special cases the Huffman algorithm has more than one choice at some steps in which of two minimum probability set of codes to pick, which can result in different trees. So the tree height/depth you mention is not unique, but the total number of bits (sum of the bit lengths of each symbol weighted by its probability) is always the same.

What does radix mean in a radix tree?

I was trying to understand the radix tree (or compact prefix tree) data structure.
I understand how lookup, insert and delete works with it. But I could not understand what does radix mean in a radix tree.
What is the purpose of radix here?

As already mentioned by #ta in the Wikipedia etymology link, the 'radix', is the the base of your trie. In this case we mean the numeric base, and we'll consider storing binary data. Radix R = 2 ^ x, where x >= 1.
Take the example of a binary (2-ary) trie. The radix is 2, so at each node you can compare one bit. The two children will handle all of the possible outcomes:
the bit is 0
the bit is 1
The next level of complexity would be a 4-ary trie. As #Garrett mentioned above, the radix must be a power of two so that it can always handle all possible sorting outcomes of the binary data we're using it for. A 4-ary trie can compare two binary bits with the four possible outcomes:
00
01
10
10
These four options each lead to a different child node.
Now, in answer to your question about the radix for the English alphabet. You want to encode letters from a to z (26 letters) so will need to have a radix of at least 2^5 = 32. This is the smallest radix that will let you switch between every letter and conform to the 'powers of two' rule. 2^4 = 16 wouldn't handle all of the letters.
As an example, let's imagine the following encoding:
00000 represents 'a',
00001 represents 'b',
... etc
11001 represents 'z',
11010 to 11111 are not in use (yet)
we can now do a comparison of five bits at every node in the tree, so every node can now switch between any Roman-alphabet letter. If you want a trie that will handle upper case letters then you will require a larger radix. A radix of 2^6 will give you enough to do this, but it comes at the cost of more wasted space (unused branches) in the trie.
Further reading: Sedgewick, Ch 15.4, on Multiway Tries. The 3rd edition of Algorithms by Cormen is generally excellent but doesn't do much for multiway tries.

There is some info on Wikipedia:
The result is that every internal node has up to the number of children of the radix r of the radix trie, where r is a positive integer and a power x of 2, having x ≥ 1.
So the radix signifies the number of children of each internal node, and that number must be a power of 2. When the radix is 2, we have a familiar binary tree.

I've posted an answer to this as a response to another thread - What is the difference between trie and radix trie data structures?. Specifically the sections Radix Trie and Trie that may not be a Radix Trie might be of special interest for this question.

Understanding Fusion Trees?

I stumbled across the Wikipedia page for them:
Fusion tree
And I read the class notes pdfs linked at the bottom, but it gets hand-wavy about the data structure itself and goes into a lot of detail about the sketch(x) function. I think part of my confusion is that the papers are trying to be very general, and I would like a specific example to visualize.
Is this data structure appropriate for storing data based on arbitrary 32 or 64 bit integer keys? How does it differ from a B-tree? There is one section that says it's basically a B-tree with a branching factor B = (lg n)^(1/5). For a fully populated tree with 32 bit keys, B would be 2. Does this just become a binary tree? Is this data structure intended to use much longer bit-strings as keys?
My Googling didn't turn up anything terribly useful, but I would welcome any good links on the topic. This is really just a passing curiosity, so I haven't been willing to pay for the PDFs at portal.acm.org yet.

You've asked a number of great questions here:
Is a fusion tree a good data structure for storing 32-bit or 64-bit numbers? Or is it designed to store longer bitstrings?
How does a fusion tree differ from a B-tree?
A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
Why is so much of the discussion of a fusion tree focused on sketching?
Is there a visualization of a fusion tree available to help understand how the structure works?
I'd like to address each of these questions in turn.
Q1: What do you store in a fusion tree? Are they good for 32-bit integers?
Your first question was about what fusion trees are designed to store. The fusion tree data structure is specifically designed to store integers that fit into a single machine word. As a result, on a 32-bit machine, you'd use the fusion tree to store integers of up to 32 bits, and on a 64-bit machine you'd use a fusion tree to store integers of up to 64 bits.
Fusion trees are not designed to handle arbitrarily long bitstrings. The design of fusion trees, which we'll get to in a little bit, is based on a technique called word-level parallelism, in which individual operations on machine words (multiplications, shifts, subtractions, etc.) are performed to implicitly operate on a large collection of numbers in parallel. In order for these techniques to work correctly, the numbers being stored need to fit into individual machine words. (It is technically possible to adapt the techniques here to work for numbers that fit into a constant number of machine words, though.)
But before we go any further, I need to include a major caveat: fusion trees are of theoretical interest only. Although fusion trees at face value seem to have excellent runtime guarantees (O(logw n) time per operation, where w is the size of the machine word), the actual implementation details are such that the hidden constant factors are enormous and a major barrier to practical adoption. The original paper on fusion trees was mostly geared toward proving that it was possible to surpass the Ω(log n) lower bound on BST operations by using word-level parallelism and without regard to wall-clock runtime costs. So in that sense, if your goal in understanding fusion trees is to use one in practice, I would recommend stopping here and searching for another data structure. On the other hand, if you're interested in seeing just how much latent power is available in humble machine words, then please read on!
Q2: How does a fusion tree differ from a regular B-tree?
At a high level, you can think of a fusion tree as a regular B-tree with some extra magic thrown in to speed up searches.
As a reminder, a B-tree of order b is a multiway search tree where, intuitively, each node stores (roughly) b keys. The B-tree is a multiway search tree, meaning that the keys in each node are stored in sorted order, and the child trees store elements that are ordered relative to those keys. For example, consider this B-tree node:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
A B C D E
Here, A, B, C, D, and E are subtrees of the root node. The subtree A consists of keys strictly less than 103, since it's to the left of 103. Subtree B consists of keys between 103 and 161, since subtree B is sandwiched between 103 and 161. Similarly, subtree C consists of keys between 161 and 166, subtree D consists of keys between 166 and 261, and subtree E consists of keys greater than 261.
To perform a search in a B-tree, you begin at the root node and repeatedly ask which subtree you need to descend into to continue the search. For example, if I wanted to look up 137 in the above tree, I'd need to somehow determine that 137 resides in subtree B. There are two "natural" ways that we could do this search:
Run a linear search over the keys to find the spot where we need to go. Time: O(b), where b is the number of keys in the node.
Run a binary search over the keys to find the spot where we need to go. Time: O(log b), where b is the number of keys in the node.
Because each node in a B-tree has a branching factor of b or greater, the height of a B-tree of order b is O(logb n). Therefore, if we use the first strategy (linear search) to find what tree to descend into, the worst-case work required for a search is O(b logb n), since we do O(b) work per level across O(logb n) levels. Fun fact: the quantity b logb n is minimized when b = e, and gets progressively worse as we increase b beyond this limit.
On the other hand, if we use a binary search to find the tree to descend into, the runtime ends up being O(log b · logb n). Using the change of base formula for logarithms, notice that
log b · logb n = log b · (log n / log b) = log n,
so the runtime of doing lookups this way is O(log n), independent of b. This matches the time bounds of searching a regular balanced BST.
The magic of the fusion tree is in finding a way to determine which subtree to descend into in time O(1). Let that sink in for a minute - we can have multiple children per node in our B-tree, stored in sorted order, and yet we can find which two keys our element is between in time O(1)! Doing so is decidedly nontrivial and is the bulk of the magic of the fusion tree. But for now, assuming that we can do this, notice that the runtime of searching the fusion tree would be O(logb n), since we do O(1) work times O(logb layers) in the tree!
The question now is how to do this.
Q3: A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
For technical reasons that will become clearer later on, a fusion tree works by choosing, as the branching parameter for the B-tree, the value b = w1/5, where w is the machine word size. On a 32-bit machine, that means that we'd pick
b = w1/5 = (25)1/5 = 2,
and on a 64-bit machine we'd pick
b = w1/5 = (26)1/5 = 26/5 ≈ 2.29,
which we'd likely round down to 2. So does that mean that a fusion tree is just a binary tree?
The answer is "not quite." In a B-tree, each node stores between b - 1 and 2b - 1 total keys. With b = 2, that means that each node stores between 1 and 3 total keys. (In other words, our B-tree would be a 2-3-4 tree, if you're familiar with that lovely data structure). This means that we'll be branching slightly more than a regular binary search tree, but not much more.
Returning to our earlier point, fusion trees are primarily of theoretical interest. The fact that we'd pick b = 2 on a real machine and barely do better than a regular binary search tree is one of the many reasons why this is the case.
On the other hand, if we were working on, say, a machine whose word size was 32,768 bits (I'm not holding my breath on seeing one of these in my lifetime), then we'd get a branching factor of b = 8, and we might actually start seeing something that beats a regular BST.
Q4: Why is so much of the discussion of a fusion tree focused on sketching?
As mentioned above, the "secret sauce" of the fusion tree is the ability to augment each node in the B-tree with some auxiliary information that makes it possible to efficiently (in time O(1)) determine which subtree of the B-tree to descend into. Once you have the ability to get this step working, the remainder of the data structure is basically just a regular B-tree. Consequently, it makes sense to focus extensively (exclusively?) on how this step works.
This is also, by far, the most complicated step in the process. Getting this step working requires the development of several highly nontrivial subroutines that, collectively, give the overall behavior.
The first technique that we'll need is a parallel rank operation. Let's return to the key question about our B-tree search: how do we determine which subtree to descend into? Let's look back to our B-tree node, as shown here:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
T0 T1 T2 T3 T4
This is the same drawing as before, but instead of labeling the subtrees A, B, C, D, and E, I've labeled them T0, T1, T2, T3, and T4.
Let's imagine I want to search for 162. That should put me into subtree T2. One way to see this is that 162 is bigger than 161 and less than 166. But there's another perspective we can take here: we want to search T2 because 162 is greater than both 103 and 161, the two keys that come before it. Interesting - we want tree index 2, and we're bigger than two of the keys in the node. Hmmm.
Now, search for 196. That puts us in tree T3, and 196 happens to be bigger than 103, 161, and 166, a total of three keys. Interesting. What about 17? That would be in tree T0, and 17 is greater than zero of the keys.
This hints at a key strategy we're going to use to get the fusion tree to work:
To determine which subtree to descend into, we need to count how many keys our search key is greater than. (This number is called the rank of the search key.)
The key insight in fusion tree is how to do this in time O(1).
Before jumping into sketching, let's build out a key primitive that we'll need for later on. The idea is the following: suppose that you have a collection of small integers, where, here, "small" means "so small that lots of them can be packed into a single machine word." Through some very clever techniques, if you can pack multiple small integers into a machine word, you can solve the following problem in time O(1):
Parallel rank: Given a key k, which is a small integer, and a fixed collection of small integers x1, ..., xb, determine how many of the xi's are less than or equal to k.
For example, we might have a bunch of 6-bit numbers, for example, 31, 41, 59, 26, and 53, and we could then execute queries like "how many of these numbers are less than or equal to 37?"
To give a brief glimpse of how this technique works, the idea is to pack all of the small integers into a single machine word, separated by zero bits. That number might look like this:
00111110101001011101100110100110101
0 31 0 41 0 59 0 26 0 53
Now, suppose we want to see how many of these numbers are less than or equal to 37. To do so, we begin by forming an integer that consists of several replicated copies of the number 37, each of which is preceded by a 1 bit. That would look like this:
11001011100101110010111001011100101
1 37 1 37 1 37 1 37 1 37
Something very cool happens if we subtract the first number from this second number. Watch this:
11001011100101110010111001011100101 1 37 1 37 1 37 1 37 1 37
- 00111110101001011101100110100110101 - 0 31 0 41 0 59 0 26 0 53
----------------------------------- ---------------------------------
10001100111100010101010010110110000 1 6 0 -4 0 -12 1 9 0 -16
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
The bits that I've highlighted here are the extra bits that we added in to the front of each number Notice that
if the top number is greater than or equal to the bottom number, then the bit in front of the subtraction result will be 1, and
if the top number is smaller than the bottom number, then the bit in front of the subtraction result will be 0.
To see why this is, if the top number is greater than or equal to the bottom number, then when we perform the subtraction, we'll never need to "borrow" from that extra 1 bit we put in front of the top number, so that bit will stay a 1. Otherwise, the top number is smaller, so to make the subtraction work out we have to borrow from that 1 bit, marking it as a zero. In other words, this single subtraction operation can be thought of as doing a parallel comparison between the original key and each of the small numbers. We're doing one subtraction, but, logically, it's five comparisons!
If we can count up how many of the marked bits are 1s, then we have the answer we want. This turns out to require some additional creativity to work in time O(1), but it is indeed possible.
This parallel rank operation shows that if we have a lot of really small keys - so small that we can pack them into a machine word - we could indeed go and compute the rank of our search key in time O(1), which would tell us which subtree we need to descend into. However, there's a catch - this strategy assumes that our keys are really small, but in general, we have no reason to assume this. If we're storing full 32-bit or 64-bit machine words as keys, we can't pack lots of them into a single machine word. We can fit exactly one key into a machine word!
To address this, fusion trees use another insight. Let's imagine that we pick the branching factor of our B-tree to be very small compared to the number of bits in a machine word (say, b = w1/5). If you have a small number of machine words, the main insight you need is that only a few of the bits in those machine words are actually relevant for determining the ordering. For example, suppose I have the following 32-bit numbers:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
Now, imagine I wanted to sort these numbers. To do so, I only really need to look at a few of the bits. For example, some of the numbers differ in their first bit (the top number A has a 0 there, and the rest have a 1). So I'll write down that I need to look at the first bit of the number. The second bit of these numbers doesn't actually help sort things - anything that differs at the second bit already differs at the first bit (do you see why?). The third bit of the number similarly does help us rank them, because numbers B, C, and D, which have the same first bit, diverge at the third bit into the groups (B, C) and D. I also would need to look at the fourth bit, which splits (B, C) apart into B and C.
In other words, to compare these numbers against one another, we'd only need to store these marked bits. If we process these bits, in order, we'd never need to look at any others:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
^ ^^
This is the sketching step you were referring to in your question, and it's used to take a small number of large numbers and turn them into a small number of small numbers. Once we have a small number of small numbers, we can then use our parallel rank step from earlier on to do rank operations in time O(1), which is what we needed to do.
Of course, there are a lot of steps that I'm skipping over here. How do you determine which bits are "interesting" bits that we need to look at? How do you extract those bits from the numbers? If you're given a number that isn't in the group, how do you figure out how it compares against the numbers in the group, given that it might differ in other bit positions? These aren't trivial questions to answer, and they're what give rise to most of the complexity of the fusion tree.
Q5: Is there a visualization of a fusion tree available to help understand how the structure works?
Yes, and no. I'll say "yes" because there are resources out there that show how the different steps work. However, I'll say "no" because I don't believe there's any one picture you can look at that will cause the whole data structure to suddenly click into focus.
I teach a course in advanced data structures and spent two 80-minute lectures building up to the fusion tree by using techniques from word-level parallelism. The discussion here is based on those lectures, which go into more depth about each step and include visualizations of the different substeps (how to compute rank in constant time, how the sketching step works, etc.), and each of those steps individually might give you a better sense for how the whole structure works. Those materials are linked here:
Part One discusses word-level parallelism, computing ranks in time O(1), building a variant of the fusion tree that works for very small integers, and computing most-significant bits in time O(1).
Part Two explores the full version of the fusion tree, introducing the basics behind the sketching step (which I call "Patricia codes" based on the connection to the Patricia trie).
To Summarize
In summary:
A fusion tree is a modification of a B-tree. The basic structure matches that of a regular B-tree, except that each node has some auxiliary information to speed up searching.
Fusion trees are purely of theoretical interest at this point. The hidden constant factors are too high and the branching factor too low to meaningfully compete with binary search trees.
Fusion trees use word-level parallelism to speed up searches, commonly by packing multiple numbers into a single machine word and using individual operations to simulate parallel processing.
The sketching step is used to reduce the number of bits in the input numbers to a point where parallel processing with a machine word is possible.
There are lecture slides detailing this in a lot more depth.
Hope this helps!

I've read (just a quick pass) the seminal paper and seems interesting. It also answers most of your questions in the first page.
You may download the paper from here
HTH!

I've read the fusion tree paper. The ideas are pretty clever, and by O notation terms he can make a case for a win.
It isn't clear to me that it is a win in practice. The constant factor matters a lot, and the chip designers work really hard to manage cheap local references.
He has to have B in his faux B-trees pretty small for real machines (B=5 for 32 bits, maybe 10 for 64 bits). That many pointers pretty much fits in a cache line. After the first cache line touch (which he can't avoid) of several hundred cycles, you can pretty much do a linear search through the keys in a few cycles per key, which means a carefully coded B-tree traditional implementation seems like it should outrun fusion trees. (I've built such B-tree code to support our program transformation system).
He claims a list of applications, but there are no comparative numbers.
Anybody have any hard evidence? (Implementations and comparisons?)

The idea behind the fusion tree is actually fairly simple. Suppose you have w-bit (say 64 bit) keys, the idea is to compress (i.e. sketching) every consecutive 64 keys in to an 64-element array. The sketching function assures a constant time mapping between the original keys and the array index for a given group. Then searching for the key becomes searching for the group containing the key, which is O(log(n/64)).
As you can see, the main challenge is the sketching function.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio