How to decode huffman tree? - algorithm

is there better way than just go left or right based on the input digit 0 or 1?

There are some papers on efficient decoding algorithms for Huffman Trees. Personally, I only used one of them, for academic reasons, but it was a long time ago. The title of the paper was "A Memory Efficient and Fast Huffman Decoding Algorithm" by Hong-Chung Chen, Yue-Li Wang and Yu-Feng Lan.
The algorithm gives results in O(log n) time. In order to use this algorithm, you have to construct a table with all the symbols of your tree (leaves), and for each symbol you have to specify a weight:
w(i) = 2 ^ (h - l)
where h is the height of Huffman Tree and l is the level of the symbol, and a count:
count(i) = count(i-1) + w(i)
The count of root, count(0), is equal to its weight.
When you have all that, there are 3 simple steps in the algorithm, that are described in the paper.
I don't know if this is what you were looking for.

Yes, there is, and you can use lookup tables.
Note that you will use quite a bit of memory to store these tables, and you will either have to ship that table with the data (probably negating the effect of the compression altogether) or construct the table before decompressing (which would negate some, if not all, of the speedup you gain from it.)
Here's how the table would work.
Let's say, for a sample portion of compressed data that the bit-sequences for symbols are as follows:
1 -> A
01 -> B
00 -> C
What you do now is to produce a table, indexed by a byte (since you need to read 1 byte minimum to decode the first symbol), containing entries like this:
key symbol bit-shift
1xxxxxxx A 7
01xxxxxx B 6
00xxxxxx C 6
The x's means you need to store entries with all possible combinations for those x's, with that data. For the first row, this means you would construct a table where every byte-key with the high-bit set would map to A/7.
The table would have entries for all 256 key values, half of them mapped to A/7, and 25% to B/6 and C/6, according to the rules above.
Of course, if the longest bitsequence for a symbol is 9-16 bites, you need a table keyed by a 16-bit integer, and so on.
Now, when you decode, here's how you would do this:
read first and second byte as key, append them together
loop:
look up the first 8 bits of the current key in the table
emit symbol found in table
shift the key bitwise to the left the number of bits specified in the table
if less than 8 bits left in the key, get next byte and append to key
when at end, just pad with 0-bytes, and as with all Huffman decompression you need to know how many symbols to emit before you start.

Sure, instead of 2-trees you can use k-trees and get O(ln_k(n)) speedup. It's not much.
If the max key size is small (say < 8 bits) or you've got lots of memory, you can use a straight lookup table & get O(1).

Related

algorithm to find duplicated byte sequences

Hello everyone who read the post. Help me please to resolve a task:
On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?
Usually this type of problems come with CPU/memory tradeoffs.
The extreme solutions would be (1) and (2) below, you can improve from there:
High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.
Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Understanding assumptions about machine word size in analyzing computer algorithms

I am reading the book Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein.. In the second chapter under "Analyzing Algorithms" it is mentioned that :
We also assume a limit on the size of each word of data. For example , when working with inputs of size n , we typically assume that integers are represented by c lg n bits for some constant c>=1 . We require c>=1 so that each word can hold the value of n , enabling us to index the individual input elements , and we restrict c to be a constant so that the word size doesn't grow arbitrarily .( If the word size could grow arbitrarily , we could store huge amounts of data in one word and operate on it all in constant time - clearly an unrealistic scenario.)
My questions are why this assumption that each integer should be represented by c lg n bits and also how c>=1 being the case allows us to index the individual input elements ?
first, by lg they apparently mean log base 2, so lg n is the number of bits in n.
then what they are saying is that if they have an algorithm that takes a list of numbers (i am being more specific in my example to help make it easier to understand) like 1,2,3,...n then they assume that:
a "word" in memory is big enough to hold any of those numbers.
a "word" in memory is not big enough to hold all the numbers (in one single word, packed in somehow).
when calculating the number of "steps" in an algorithm, an operation on one "word" takes one step.
the reason they are doing this is to keep the analysis realistic (you can only store numbers up to some size in "native" types; after that you need to switch to arbitrary precision libraries) without choosing a particular example (like 32 bit integers) that might be inappropriate in some cases, or become outdated.
You need at least lg n bits to represent integers of size n, so that's a lower bound on the number of bits needed to store inputs of size n. Setting the constant c >= 1 makes it a lower bound. If the constant multiplier were less than 1, you wouldn't have enough bits to store n.
This is a simplifying step in the RAM model. It allows you to treat each individual input value as though it were accessible in a single slot (or "word") of memory, instead of worrying about complications that might arise otherwise. (Loading, storing, and copying values of different word sizes would take differing amounts of time if we used a model that allowed varying word lengths.) This is what's meant by "enabling us to index the individual input elements." Each input element of the problem is assumed to be accessible at a single address, or index (meaning it fits in one word of memory), simplifying the model.
This question was asked very long ago and the explanations really helped me, but I feel like there could still be a little more clarification about how the lg n came about. For me talking through things really helps:
Lets choose a random number in base 10, like 27, we need 5 bits to store this. Why? Well because 27 is 11011 in binary. Notice 11011 has 5 digits each 'digit' is what we call a bit hence 5 bits.
Think of each bit as being a slot. For binary, each of those slots can hold a 0 or 1. What's the largest number I can store with 5 bits? Well, the largest number would fill each slot: 11111
11111 = 31 = 2^5 so to store 31 we need 5 bits and 31 is 2^5
Generally (and I will use very explicit names for clarity):
numToStore = 2 ^ numBitsNeeded
Since log is the mathematical inverse of exponent we get:
log(numToStore) = numBitsNeeded
Since this is likely to not result in an integer, we use ceil to round our answer up. So applying our example to find how many bits are needed to store the number 31:
log(31) = 4.954196310386876 = 5 bits

Understanding Fusion Trees?

I stumbled across the Wikipedia page for them:
Fusion tree
And I read the class notes pdfs linked at the bottom, but it gets hand-wavy about the data structure itself and goes into a lot of detail about the sketch(x) function. I think part of my confusion is that the papers are trying to be very general, and I would like a specific example to visualize.
Is this data structure appropriate for storing data based on arbitrary 32 or 64 bit integer keys? How does it differ from a B-tree? There is one section that says it's basically a B-tree with a branching factor B = (lg n)^(1/5). For a fully populated tree with 32 bit keys, B would be 2. Does this just become a binary tree? Is this data structure intended to use much longer bit-strings as keys?
My Googling didn't turn up anything terribly useful, but I would welcome any good links on the topic. This is really just a passing curiosity, so I haven't been willing to pay for the PDFs at portal.acm.org yet.
You've asked a number of great questions here:
Is a fusion tree a good data structure for storing 32-bit or 64-bit numbers? Or is it designed to store longer bitstrings?
How does a fusion tree differ from a B-tree?
A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
Why is so much of the discussion of a fusion tree focused on sketching?
Is there a visualization of a fusion tree available to help understand how the structure works?
I'd like to address each of these questions in turn.
Q1: What do you store in a fusion tree? Are they good for 32-bit integers?
Your first question was about what fusion trees are designed to store. The fusion tree data structure is specifically designed to store integers that fit into a single machine word. As a result, on a 32-bit machine, you'd use the fusion tree to store integers of up to 32 bits, and on a 64-bit machine you'd use a fusion tree to store integers of up to 64 bits.
Fusion trees are not designed to handle arbitrarily long bitstrings. The design of fusion trees, which we'll get to in a little bit, is based on a technique called word-level parallelism, in which individual operations on machine words (multiplications, shifts, subtractions, etc.) are performed to implicitly operate on a large collection of numbers in parallel. In order for these techniques to work correctly, the numbers being stored need to fit into individual machine words. (It is technically possible to adapt the techniques here to work for numbers that fit into a constant number of machine words, though.)
But before we go any further, I need to include a major caveat: fusion trees are of theoretical interest only. Although fusion trees at face value seem to have excellent runtime guarantees (O(logw n) time per operation, where w is the size of the machine word), the actual implementation details are such that the hidden constant factors are enormous and a major barrier to practical adoption. The original paper on fusion trees was mostly geared toward proving that it was possible to surpass the Ω(log n) lower bound on BST operations by using word-level parallelism and without regard to wall-clock runtime costs. So in that sense, if your goal in understanding fusion trees is to use one in practice, I would recommend stopping here and searching for another data structure. On the other hand, if you're interested in seeing just how much latent power is available in humble machine words, then please read on!
Q2: How does a fusion tree differ from a regular B-tree?
At a high level, you can think of a fusion tree as a regular B-tree with some extra magic thrown in to speed up searches.
As a reminder, a B-tree of order b is a multiway search tree where, intuitively, each node stores (roughly) b keys. The B-tree is a multiway search tree, meaning that the keys in each node are stored in sorted order, and the child trees store elements that are ordered relative to those keys. For example, consider this B-tree node:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
A B C D E
Here, A, B, C, D, and E are subtrees of the root node. The subtree A consists of keys strictly less than 103, since it's to the left of 103. Subtree B consists of keys between 103 and 161, since subtree B is sandwiched between 103 and 161. Similarly, subtree C consists of keys between 161 and 166, subtree D consists of keys between 166 and 261, and subtree E consists of keys greater than 261.
To perform a search in a B-tree, you begin at the root node and repeatedly ask which subtree you need to descend into to continue the search. For example, if I wanted to look up 137 in the above tree, I'd need to somehow determine that 137 resides in subtree B. There are two "natural" ways that we could do this search:
Run a linear search over the keys to find the spot where we need to go. Time: O(b), where b is the number of keys in the node.
Run a binary search over the keys to find the spot where we need to go. Time: O(log b), where b is the number of keys in the node.
Because each node in a B-tree has a branching factor of b or greater, the height of a B-tree of order b is O(logb n). Therefore, if we use the first strategy (linear search) to find what tree to descend into, the worst-case work required for a search is O(b logb n), since we do O(b) work per level across O(logb n) levels. Fun fact: the quantity b logb n is minimized when b = e, and gets progressively worse as we increase b beyond this limit.
On the other hand, if we use a binary search to find the tree to descend into, the runtime ends up being O(log b · logb n). Using the change of base formula for logarithms, notice that
log b · logb n = log b · (log n / log b) = log n,
so the runtime of doing lookups this way is O(log n), independent of b. This matches the time bounds of searching a regular balanced BST.
The magic of the fusion tree is in finding a way to determine which subtree to descend into in time O(1). Let that sink in for a minute - we can have multiple children per node in our B-tree, stored in sorted order, and yet we can find which two keys our element is between in time O(1)! Doing so is decidedly nontrivial and is the bulk of the magic of the fusion tree. But for now, assuming that we can do this, notice that the runtime of searching the fusion tree would be O(logb n), since we do O(1) work times O(logb layers) in the tree!
The question now is how to do this.
Q3: A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
For technical reasons that will become clearer later on, a fusion tree works by choosing, as the branching parameter for the B-tree, the value b = w1/5, where w is the machine word size. On a 32-bit machine, that means that we'd pick
b = w1/5 = (25)1/5 = 2,
and on a 64-bit machine we'd pick
b = w1/5 = (26)1/5 = 26/5 ≈ 2.29,
which we'd likely round down to 2. So does that mean that a fusion tree is just a binary tree?
The answer is "not quite." In a B-tree, each node stores between b - 1 and 2b - 1 total keys. With b = 2, that means that each node stores between 1 and 3 total keys. (In other words, our B-tree would be a 2-3-4 tree, if you're familiar with that lovely data structure). This means that we'll be branching slightly more than a regular binary search tree, but not much more.
Returning to our earlier point, fusion trees are primarily of theoretical interest. The fact that we'd pick b = 2 on a real machine and barely do better than a regular binary search tree is one of the many reasons why this is the case.
On the other hand, if we were working on, say, a machine whose word size was 32,768 bits (I'm not holding my breath on seeing one of these in my lifetime), then we'd get a branching factor of b = 8, and we might actually start seeing something that beats a regular BST.
Q4: Why is so much of the discussion of a fusion tree focused on sketching?
As mentioned above, the "secret sauce" of the fusion tree is the ability to augment each node in the B-tree with some auxiliary information that makes it possible to efficiently (in time O(1)) determine which subtree of the B-tree to descend into. Once you have the ability to get this step working, the remainder of the data structure is basically just a regular B-tree. Consequently, it makes sense to focus extensively (exclusively?) on how this step works.
This is also, by far, the most complicated step in the process. Getting this step working requires the development of several highly nontrivial subroutines that, collectively, give the overall behavior.
The first technique that we'll need is a parallel rank operation. Let's return to the key question about our B-tree search: how do we determine which subtree to descend into? Let's look back to our B-tree node, as shown here:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
T0 T1 T2 T3 T4
This is the same drawing as before, but instead of labeling the subtrees A, B, C, D, and E, I've labeled them T0, T1, T2, T3, and T4.
Let's imagine I want to search for 162. That should put me into subtree T2. One way to see this is that 162 is bigger than 161 and less than 166. But there's another perspective we can take here: we want to search T2 because 162 is greater than both 103 and 161, the two keys that come before it. Interesting - we want tree index 2, and we're bigger than two of the keys in the node. Hmmm.
Now, search for 196. That puts us in tree T3, and 196 happens to be bigger than 103, 161, and 166, a total of three keys. Interesting. What about 17? That would be in tree T0, and 17 is greater than zero of the keys.
This hints at a key strategy we're going to use to get the fusion tree to work:
To determine which subtree to descend into, we need to count how many keys our search key is greater than. (This number is called the rank of the search key.)
The key insight in fusion tree is how to do this in time O(1).
Before jumping into sketching, let's build out a key primitive that we'll need for later on. The idea is the following: suppose that you have a collection of small integers, where, here, "small" means "so small that lots of them can be packed into a single machine word." Through some very clever techniques, if you can pack multiple small integers into a machine word, you can solve the following problem in time O(1):
Parallel rank: Given a key k, which is a small integer, and a fixed collection of small integers x1, ..., xb, determine how many of the xi's are less than or equal to k.
For example, we might have a bunch of 6-bit numbers, for example, 31, 41, 59, 26, and 53, and we could then execute queries like "how many of these numbers are less than or equal to 37?"
To give a brief glimpse of how this technique works, the idea is to pack all of the small integers into a single machine word, separated by zero bits. That number might look like this:
00111110101001011101100110100110101
0 31 0 41 0 59 0 26 0 53
Now, suppose we want to see how many of these numbers are less than or equal to 37. To do so, we begin by forming an integer that consists of several replicated copies of the number 37, each of which is preceded by a 1 bit. That would look like this:
11001011100101110010111001011100101
1 37 1 37 1 37 1 37 1 37
Something very cool happens if we subtract the first number from this second number. Watch this:
11001011100101110010111001011100101 1 37 1 37 1 37 1 37 1 37
- 00111110101001011101100110100110101 - 0 31 0 41 0 59 0 26 0 53
----------------------------------- ---------------------------------
10001100111100010101010010110110000 1 6 0 -4 0 -12 1 9 0 -16
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
The bits that I've highlighted here are the extra bits that we added in to the front of each number Notice that
if the top number is greater than or equal to the bottom number, then the bit in front of the subtraction result will be 1, and
if the top number is smaller than the bottom number, then the bit in front of the subtraction result will be 0.
To see why this is, if the top number is greater than or equal to the bottom number, then when we perform the subtraction, we'll never need to "borrow" from that extra 1 bit we put in front of the top number, so that bit will stay a 1. Otherwise, the top number is smaller, so to make the subtraction work out we have to borrow from that 1 bit, marking it as a zero. In other words, this single subtraction operation can be thought of as doing a parallel comparison between the original key and each of the small numbers. We're doing one subtraction, but, logically, it's five comparisons!
If we can count up how many of the marked bits are 1s, then we have the answer we want. This turns out to require some additional creativity to work in time O(1), but it is indeed possible.
This parallel rank operation shows that if we have a lot of really small keys - so small that we can pack them into a machine word - we could indeed go and compute the rank of our search key in time O(1), which would tell us which subtree we need to descend into. However, there's a catch - this strategy assumes that our keys are really small, but in general, we have no reason to assume this. If we're storing full 32-bit or 64-bit machine words as keys, we can't pack lots of them into a single machine word. We can fit exactly one key into a machine word!
To address this, fusion trees use another insight. Let's imagine that we pick the branching factor of our B-tree to be very small compared to the number of bits in a machine word (say, b = w1/5). If you have a small number of machine words, the main insight you need is that only a few of the bits in those machine words are actually relevant for determining the ordering. For example, suppose I have the following 32-bit numbers:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
Now, imagine I wanted to sort these numbers. To do so, I only really need to look at a few of the bits. For example, some of the numbers differ in their first bit (the top number A has a 0 there, and the rest have a 1). So I'll write down that I need to look at the first bit of the number. The second bit of these numbers doesn't actually help sort things - anything that differs at the second bit already differs at the first bit (do you see why?). The third bit of the number similarly does help us rank them, because numbers B, C, and D, which have the same first bit, diverge at the third bit into the groups (B, C) and D. I also would need to look at the fourth bit, which splits (B, C) apart into B and C.
In other words, to compare these numbers against one another, we'd only need to store these marked bits. If we process these bits, in order, we'd never need to look at any others:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
^ ^^
This is the sketching step you were referring to in your question, and it's used to take a small number of large numbers and turn them into a small number of small numbers. Once we have a small number of small numbers, we can then use our parallel rank step from earlier on to do rank operations in time O(1), which is what we needed to do.
Of course, there are a lot of steps that I'm skipping over here. How do you determine which bits are "interesting" bits that we need to look at? How do you extract those bits from the numbers? If you're given a number that isn't in the group, how do you figure out how it compares against the numbers in the group, given that it might differ in other bit positions? These aren't trivial questions to answer, and they're what give rise to most of the complexity of the fusion tree.
Q5: Is there a visualization of a fusion tree available to help understand how the structure works?
Yes, and no. I'll say "yes" because there are resources out there that show how the different steps work. However, I'll say "no" because I don't believe there's any one picture you can look at that will cause the whole data structure to suddenly click into focus.
I teach a course in advanced data structures and spent two 80-minute lectures building up to the fusion tree by using techniques from word-level parallelism. The discussion here is based on those lectures, which go into more depth about each step and include visualizations of the different substeps (how to compute rank in constant time, how the sketching step works, etc.), and each of those steps individually might give you a better sense for how the whole structure works. Those materials are linked here:
Part One discusses word-level parallelism, computing ranks in time O(1), building a variant of the fusion tree that works for very small integers, and computing most-significant bits in time O(1).
Part Two explores the full version of the fusion tree, introducing the basics behind the sketching step (which I call "Patricia codes" based on the connection to the Patricia trie).
To Summarize
In summary:
A fusion tree is a modification of a B-tree. The basic structure matches that of a regular B-tree, except that each node has some auxiliary information to speed up searching.
Fusion trees are purely of theoretical interest at this point. The hidden constant factors are too high and the branching factor too low to meaningfully compete with binary search trees.
Fusion trees use word-level parallelism to speed up searches, commonly by packing multiple numbers into a single machine word and using individual operations to simulate parallel processing.
The sketching step is used to reduce the number of bits in the input numbers to a point where parallel processing with a machine word is possible.
There are lecture slides detailing this in a lot more depth.
Hope this helps!
I've read (just a quick pass) the seminal paper and seems interesting. It also answers most of your questions in the first page.
You may download the paper from here
HTH!
I've read the fusion tree paper. The ideas are pretty clever, and by O notation terms he can make a case for a win.
It isn't clear to me that it is a win in practice. The constant factor matters a lot, and the chip designers work really hard to manage cheap local references.
He has to have B in his faux B-trees pretty small for real machines (B=5 for 32 bits, maybe 10 for 64 bits). That many pointers pretty much fits in a cache line. After the first cache line touch (which he can't avoid) of several hundred cycles, you can pretty much do a linear search through the keys in a few cycles per key, which means a carefully coded B-tree traditional implementation seems like it should outrun fusion trees. (I've built such B-tree code to support our program transformation system).
He claims a list of applications, but there are no comparative numbers.
Anybody have any hard evidence? (Implementations and comparisons?)
The idea behind the fusion tree is actually fairly simple. Suppose you have w-bit (say 64 bit) keys, the idea is to compress (i.e. sketching) every consecutive 64 keys in to an 64-element array. The sketching function assures a constant time mapping between the original keys and the array index for a given group. Then searching for the key becomes searching for the group containing the key, which is O(log(n/64)).
As you can see, the main challenge is the sketching function.

Hash Functions and Tables of size of the form 2^p

While calculating the hash table bucket index from the hash code of a key, why do we avoid use of remainder after division (modulo) when the size of the array of buckets is a power of 2?
When calculating the hash, you want as much information as you can cheaply munge things into with good distribution across the entire range of bits: e.g. 32-bit unsigned integers are usually good, unless you have a lot (>3 billion) of items to store in the hash table.
It's converting the hash code into a bucket index that you're really interested in. When the number of buckets n is a power of two, all you need to do is do an AND operation between hash code h and (n-1), and the result is equal to h mod n.
A reason this may be bad is that the AND operation is simply discarding bits - the high-level bits - from the hash code. This may be good or bad, depending on other things. On one hand, it will be very fast, since AND is a lot faster than division (and is the usual reason why you would choose to use a power of 2 number of buckets), but on the other hand, poor hash functions may have poor entropy in the lower bits: that is, the lower bits don't change much when the data being hashed changes.
Let us say that the table size is m = 2^p.
Let k be a key.
Then, whenever we do k mod m, we will only get the last p bits of the binary representation of k. Thus, if I put in several keys that have the same last p bits, the hash function will perform VERY VERY badly as all keys will be hashed to the same slot in the table. Thus, avoid powers of 2

Resources