Understanding Fusion Trees? - data-structures

I stumbled across the Wikipedia page for them:
Fusion tree
And I read the class notes pdfs linked at the bottom, but it gets hand-wavy about the data structure itself and goes into a lot of detail about the sketch(x) function. I think part of my confusion is that the papers are trying to be very general, and I would like a specific example to visualize.
Is this data structure appropriate for storing data based on arbitrary 32 or 64 bit integer keys? How does it differ from a B-tree? There is one section that says it's basically a B-tree with a branching factor B = (lg n)^(1/5). For a fully populated tree with 32 bit keys, B would be 2. Does this just become a binary tree? Is this data structure intended to use much longer bit-strings as keys?
My Googling didn't turn up anything terribly useful, but I would welcome any good links on the topic. This is really just a passing curiosity, so I haven't been willing to pay for the PDFs at portal.acm.org yet.

You've asked a number of great questions here:
Is a fusion tree a good data structure for storing 32-bit or 64-bit numbers? Or is it designed to store longer bitstrings?
How does a fusion tree differ from a B-tree?
A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
Why is so much of the discussion of a fusion tree focused on sketching?
Is there a visualization of a fusion tree available to help understand how the structure works?
I'd like to address each of these questions in turn.
Q1: What do you store in a fusion tree? Are they good for 32-bit integers?
Your first question was about what fusion trees are designed to store. The fusion tree data structure is specifically designed to store integers that fit into a single machine word. As a result, on a 32-bit machine, you'd use the fusion tree to store integers of up to 32 bits, and on a 64-bit machine you'd use a fusion tree to store integers of up to 64 bits.
Fusion trees are not designed to handle arbitrarily long bitstrings. The design of fusion trees, which we'll get to in a little bit, is based on a technique called word-level parallelism, in which individual operations on machine words (multiplications, shifts, subtractions, etc.) are performed to implicitly operate on a large collection of numbers in parallel. In order for these techniques to work correctly, the numbers being stored need to fit into individual machine words. (It is technically possible to adapt the techniques here to work for numbers that fit into a constant number of machine words, though.)
But before we go any further, I need to include a major caveat: fusion trees are of theoretical interest only. Although fusion trees at face value seem to have excellent runtime guarantees (O(logw n) time per operation, where w is the size of the machine word), the actual implementation details are such that the hidden constant factors are enormous and a major barrier to practical adoption. The original paper on fusion trees was mostly geared toward proving that it was possible to surpass the Ω(log n) lower bound on BST operations by using word-level parallelism and without regard to wall-clock runtime costs. So in that sense, if your goal in understanding fusion trees is to use one in practice, I would recommend stopping here and searching for another data structure. On the other hand, if you're interested in seeing just how much latent power is available in humble machine words, then please read on!
Q2: How does a fusion tree differ from a regular B-tree?
At a high level, you can think of a fusion tree as a regular B-tree with some extra magic thrown in to speed up searches.
As a reminder, a B-tree of order b is a multiway search tree where, intuitively, each node stores (roughly) b keys. The B-tree is a multiway search tree, meaning that the keys in each node are stored in sorted order, and the child trees store elements that are ordered relative to those keys. For example, consider this B-tree node:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
A B C D E
Here, A, B, C, D, and E are subtrees of the root node. The subtree A consists of keys strictly less than 103, since it's to the left of 103. Subtree B consists of keys between 103 and 161, since subtree B is sandwiched between 103 and 161. Similarly, subtree C consists of keys between 161 and 166, subtree D consists of keys between 166 and 261, and subtree E consists of keys greater than 261.
To perform a search in a B-tree, you begin at the root node and repeatedly ask which subtree you need to descend into to continue the search. For example, if I wanted to look up 137 in the above tree, I'd need to somehow determine that 137 resides in subtree B. There are two "natural" ways that we could do this search:
Run a linear search over the keys to find the spot where we need to go. Time: O(b), where b is the number of keys in the node.
Run a binary search over the keys to find the spot where we need to go. Time: O(log b), where b is the number of keys in the node.
Because each node in a B-tree has a branching factor of b or greater, the height of a B-tree of order b is O(logb n). Therefore, if we use the first strategy (linear search) to find what tree to descend into, the worst-case work required for a search is O(b logb n), since we do O(b) work per level across O(logb n) levels. Fun fact: the quantity b logb n is minimized when b = e, and gets progressively worse as we increase b beyond this limit.
On the other hand, if we use a binary search to find the tree to descend into, the runtime ends up being O(log b · logb n). Using the change of base formula for logarithms, notice that
log b · logb n = log b · (log n / log b) = log n,
so the runtime of doing lookups this way is O(log n), independent of b. This matches the time bounds of searching a regular balanced BST.
The magic of the fusion tree is in finding a way to determine which subtree to descend into in time O(1). Let that sink in for a minute - we can have multiple children per node in our B-tree, stored in sorted order, and yet we can find which two keys our element is between in time O(1)! Doing so is decidedly nontrivial and is the bulk of the magic of the fusion tree. But for now, assuming that we can do this, notice that the runtime of searching the fusion tree would be O(logb n), since we do O(1) work times O(logb layers) in the tree!
The question now is how to do this.
Q3: A fusion tree picks b = w1/5, where w is the machine word size. Does this mean that b = 2 on a 32-bit machine, and does that make it just a binary tree?
For technical reasons that will become clearer later on, a fusion tree works by choosing, as the branching parameter for the B-tree, the value b = w1/5, where w is the machine word size. On a 32-bit machine, that means that we'd pick
b = w1/5 = (25)1/5 = 2,
and on a 64-bit machine we'd pick
b = w1/5 = (26)1/5 = 26/5 ≈ 2.29,
which we'd likely round down to 2. So does that mean that a fusion tree is just a binary tree?
The answer is "not quite." In a B-tree, each node stores between b - 1 and 2b - 1 total keys. With b = 2, that means that each node stores between 1 and 3 total keys. (In other words, our B-tree would be a 2-3-4 tree, if you're familiar with that lovely data structure). This means that we'll be branching slightly more than a regular binary search tree, but not much more.
Returning to our earlier point, fusion trees are primarily of theoretical interest. The fact that we'd pick b = 2 on a real machine and barely do better than a regular binary search tree is one of the many reasons why this is the case.
On the other hand, if we were working on, say, a machine whose word size was 32,768 bits (I'm not holding my breath on seeing one of these in my lifetime), then we'd get a branching factor of b = 8, and we might actually start seeing something that beats a regular BST.
Q4: Why is so much of the discussion of a fusion tree focused on sketching?
As mentioned above, the "secret sauce" of the fusion tree is the ability to augment each node in the B-tree with some auxiliary information that makes it possible to efficiently (in time O(1)) determine which subtree of the B-tree to descend into. Once you have the ability to get this step working, the remainder of the data structure is basically just a regular B-tree. Consequently, it makes sense to focus extensively (exclusively?) on how this step works.
This is also, by far, the most complicated step in the process. Getting this step working requires the development of several highly nontrivial subroutines that, collectively, give the overall behavior.
The first technique that we'll need is a parallel rank operation. Let's return to the key question about our B-tree search: how do we determine which subtree to descend into? Let's look back to our B-tree node, as shown here:
+-----+-----+-----+-----+
| 103 | 161 | 166 | 261 |
+-----+-----+-----+-----+
/ | | | \
/ | | | \
T0 T1 T2 T3 T4
This is the same drawing as before, but instead of labeling the subtrees A, B, C, D, and E, I've labeled them T0, T1, T2, T3, and T4.
Let's imagine I want to search for 162. That should put me into subtree T2. One way to see this is that 162 is bigger than 161 and less than 166. But there's another perspective we can take here: we want to search T2 because 162 is greater than both 103 and 161, the two keys that come before it. Interesting - we want tree index 2, and we're bigger than two of the keys in the node. Hmmm.
Now, search for 196. That puts us in tree T3, and 196 happens to be bigger than 103, 161, and 166, a total of three keys. Interesting. What about 17? That would be in tree T0, and 17 is greater than zero of the keys.
This hints at a key strategy we're going to use to get the fusion tree to work:
To determine which subtree to descend into, we need to count how many keys our search key is greater than. (This number is called the rank of the search key.)
The key insight in fusion tree is how to do this in time O(1).
Before jumping into sketching, let's build out a key primitive that we'll need for later on. The idea is the following: suppose that you have a collection of small integers, where, here, "small" means "so small that lots of them can be packed into a single machine word." Through some very clever techniques, if you can pack multiple small integers into a machine word, you can solve the following problem in time O(1):
Parallel rank: Given a key k, which is a small integer, and a fixed collection of small integers x1, ..., xb, determine how many of the xi's are less than or equal to k.
For example, we might have a bunch of 6-bit numbers, for example, 31, 41, 59, 26, and 53, and we could then execute queries like "how many of these numbers are less than or equal to 37?"
To give a brief glimpse of how this technique works, the idea is to pack all of the small integers into a single machine word, separated by zero bits. That number might look like this:
00111110101001011101100110100110101
0 31 0 41 0 59 0 26 0 53
Now, suppose we want to see how many of these numbers are less than or equal to 37. To do so, we begin by forming an integer that consists of several replicated copies of the number 37, each of which is preceded by a 1 bit. That would look like this:
11001011100101110010111001011100101
1 37 1 37 1 37 1 37 1 37
Something very cool happens if we subtract the first number from this second number. Watch this:
11001011100101110010111001011100101 1 37 1 37 1 37 1 37 1 37
- 00111110101001011101100110100110101 - 0 31 0 41 0 59 0 26 0 53
----------------------------------- ---------------------------------
10001100111100010101010010110110000 1 6 0 -4 0 -12 1 9 0 -16
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
The bits that I've highlighted here are the extra bits that we added in to the front of each number Notice that
if the top number is greater than or equal to the bottom number, then the bit in front of the subtraction result will be 1, and
if the top number is smaller than the bottom number, then the bit in front of the subtraction result will be 0.
To see why this is, if the top number is greater than or equal to the bottom number, then when we perform the subtraction, we'll never need to "borrow" from that extra 1 bit we put in front of the top number, so that bit will stay a 1. Otherwise, the top number is smaller, so to make the subtraction work out we have to borrow from that 1 bit, marking it as a zero. In other words, this single subtraction operation can be thought of as doing a parallel comparison between the original key and each of the small numbers. We're doing one subtraction, but, logically, it's five comparisons!
If we can count up how many of the marked bits are 1s, then we have the answer we want. This turns out to require some additional creativity to work in time O(1), but it is indeed possible.
This parallel rank operation shows that if we have a lot of really small keys - so small that we can pack them into a machine word - we could indeed go and compute the rank of our search key in time O(1), which would tell us which subtree we need to descend into. However, there's a catch - this strategy assumes that our keys are really small, but in general, we have no reason to assume this. If we're storing full 32-bit or 64-bit machine words as keys, we can't pack lots of them into a single machine word. We can fit exactly one key into a machine word!
To address this, fusion trees use another insight. Let's imagine that we pick the branching factor of our B-tree to be very small compared to the number of bits in a machine word (say, b = w1/5). If you have a small number of machine words, the main insight you need is that only a few of the bits in those machine words are actually relevant for determining the ordering. For example, suppose I have the following 32-bit numbers:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
Now, imagine I wanted to sort these numbers. To do so, I only really need to look at a few of the bits. For example, some of the numbers differ in their first bit (the top number A has a 0 there, and the rest have a 1). So I'll write down that I need to look at the first bit of the number. The second bit of these numbers doesn't actually help sort things - anything that differs at the second bit already differs at the first bit (do you see why?). The third bit of the number similarly does help us rank them, because numbers B, C, and D, which have the same first bit, diverge at the third bit into the groups (B, C) and D. I also would need to look at the fourth bit, which splits (B, C) apart into B and C.
In other words, to compare these numbers against one another, we'd only need to store these marked bits. If we process these bits, in order, we'd never need to look at any others:
A: 00110101000101000101000100000101
B: 11001000010000001000000000000000
C: 11011100101110111100010011010101
D: 11110100100001000000001000000000
^ ^^
This is the sketching step you were referring to in your question, and it's used to take a small number of large numbers and turn them into a small number of small numbers. Once we have a small number of small numbers, we can then use our parallel rank step from earlier on to do rank operations in time O(1), which is what we needed to do.
Of course, there are a lot of steps that I'm skipping over here. How do you determine which bits are "interesting" bits that we need to look at? How do you extract those bits from the numbers? If you're given a number that isn't in the group, how do you figure out how it compares against the numbers in the group, given that it might differ in other bit positions? These aren't trivial questions to answer, and they're what give rise to most of the complexity of the fusion tree.
Q5: Is there a visualization of a fusion tree available to help understand how the structure works?
Yes, and no. I'll say "yes" because there are resources out there that show how the different steps work. However, I'll say "no" because I don't believe there's any one picture you can look at that will cause the whole data structure to suddenly click into focus.
I teach a course in advanced data structures and spent two 80-minute lectures building up to the fusion tree by using techniques from word-level parallelism. The discussion here is based on those lectures, which go into more depth about each step and include visualizations of the different substeps (how to compute rank in constant time, how the sketching step works, etc.), and each of those steps individually might give you a better sense for how the whole structure works. Those materials are linked here:
Part One discusses word-level parallelism, computing ranks in time O(1), building a variant of the fusion tree that works for very small integers, and computing most-significant bits in time O(1).
Part Two explores the full version of the fusion tree, introducing the basics behind the sketching step (which I call "Patricia codes" based on the connection to the Patricia trie).
To Summarize
In summary:
A fusion tree is a modification of a B-tree. The basic structure matches that of a regular B-tree, except that each node has some auxiliary information to speed up searching.
Fusion trees are purely of theoretical interest at this point. The hidden constant factors are too high and the branching factor too low to meaningfully compete with binary search trees.
Fusion trees use word-level parallelism to speed up searches, commonly by packing multiple numbers into a single machine word and using individual operations to simulate parallel processing.
The sketching step is used to reduce the number of bits in the input numbers to a point where parallel processing with a machine word is possible.
There are lecture slides detailing this in a lot more depth.
Hope this helps!

I've read (just a quick pass) the seminal paper and seems interesting. It also answers most of your questions in the first page.
You may download the paper from here
HTH!

I've read the fusion tree paper. The ideas are pretty clever, and by O notation terms he can make a case for a win.
It isn't clear to me that it is a win in practice. The constant factor matters a lot, and the chip designers work really hard to manage cheap local references.
He has to have B in his faux B-trees pretty small for real machines (B=5 for 32 bits, maybe 10 for 64 bits). That many pointers pretty much fits in a cache line. After the first cache line touch (which he can't avoid) of several hundred cycles, you can pretty much do a linear search through the keys in a few cycles per key, which means a carefully coded B-tree traditional implementation seems like it should outrun fusion trees. (I've built such B-tree code to support our program transformation system).
He claims a list of applications, but there are no comparative numbers.
Anybody have any hard evidence? (Implementations and comparisons?)

The idea behind the fusion tree is actually fairly simple. Suppose you have w-bit (say 64 bit) keys, the idea is to compress (i.e. sketching) every consecutive 64 keys in to an 64-element array. The sketching function assures a constant time mapping between the original keys and the array index for a given group. Then searching for the key becomes searching for the group containing the key, which is O(log(n/64)).
As you can see, the main challenge is the sketching function.

Related

How to find all the binary string of length 9, having 4 ones and rest zeroes and hamming distance of 4 (if we consider any two strings) [duplicate]

Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
static inline int distance(unsigned x, unsigned y)
{
return __builtin_popcount(x^y);
}
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.
Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query
-- BK Tree -- -- VP Tree -- -- Linear --
Dist Results Speed Cov Speed Cov Speed Cov
1 0.90 3800 0.048% 4200 0.048%
2 11 300 0.68% 330 0.65%
3 130 56 3.8% 63 3.4%
4 970 18 12% 22 10%
5 5700 8.5 26% 10 22%
6 2.6e4 5.2 42% 6.0 37%
7 1.1e5 3.7 60% 4.1 54%
8 3.5e5 3.0 74% 3.2 70%
9 1.0e6 2.6 85% 2.7 82%
10 2.5e6 2.3 91% 2.4 90%
any 2.2 100%
In your comment, you mentioned:
I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
Dist BK Tree VP Tree Bitset Linear
1 10,133.83 15,773.69 1,905,202.76 4.73
2 677.78 1,006.95 218,624.08 4.70
3 113.14 173.15 27,022.32 4.76
4 34.06 54.13 4,239.28 4.75
5 15.21 23.81 932.18 4.79
6 8.96 13.23 236.09 4.78
7 6.52 8.37 69.18 4.77
8 5.11 6.15 23.76 4.68
9 4.39 4.83 9.01 4.47
10 3.69 3.94 2.82 4.13
Prepare 4.1s 21.0s 1.52s 0.13s
times (for building the data structure before the queries)
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.
One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.
for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE
[...]
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
THE HAMMING DISTANCE PROBLEM
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
[...]
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
Note: I posted a similar answer to a related DB-only question
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
Keep the number of union operations executed in a variable.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.
Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).
Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.

Binary search for no uniform distribution

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.
Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.
There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.
One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.
Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.
Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.
Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.
A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.
Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.
The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.
One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.
A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.
Let me make it precise. What you want for binary search is:
Given array A which is sorted, but have non-uniform distribution
Given left & right index L & R of search range
Want to search for a value X in A
To apply binary search, we want to find the index M in [L,R]
as the next position to look at.
Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]
In general, you of course want to pick M where you think X value should be in A.
Because even if you miss, half the total 'chance' would be eliminated.
So it seems to me you have some expectation about distribution.
If you could tell us what exactly do you mean by '1/x distribution', then
maybe someone here can help build on my suggestion for you.
Let me give a worked example.
I'll use similar interpretation of '1/x distribution' as #Leonid Volnitsky
Here is a Python code that generate the input array A
from random import uniform
# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()
# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]
Let assume the value to search for is:
X = 0.0553
Then the estimated index of X is:
= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)
So how to calculate P(x <= X) ?
It this case it is simple.
We reverse X back to the value between [a,b] which we will call
X' = 1/X ~ 18
Hence
P(x <= X) = (b-X')/(b-a)
= (20-18)/(20-10)
= 2/10
So the expected position of X is:
10*(2/10) = 2
Well, and that's pretty damn accurate!
To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.
I know this might not seems like a binary search anymore
if you can get that close to the answer in just one step.
But admit it, this is what you can do if you know the distribution of input array.
The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.
You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.
This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.
Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:
30
/ \
20 50
/ / \
10 40 60
Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:
f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]
Into a new search tree, that looks like:
10 -------------> the most probable of being taken
/ \ leaving v = [[20, 30], [40, 50, 60]]
20 30 ---------> the most probable of being taken
/ \ leaving v = [[40, 50], [60]]
40 50 -------> the most probable of being taken
/ leaving v = [[60]]
60
If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by #Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.
I will assume from your description:
X is uniformly distributed
Y=1/X is your data which you want to search and it is stored in sorted table
given value y, you need to binary search it in the above table
Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.
For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.
To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.
Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.
The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).
In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).
Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N).
We are looking for key K and want index I so that
[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.
Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).
The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.
Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).
As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).
If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].
The above proportionally weighted binary search can used to improve on the initial probe.

Given a flat file of IP Ranges and mappings, find a city given an IP

This is the question:
Given a flat text file that contains a range of IP addresses that map
to a location (e.g.
192.168.0.0-192.168.0.255 = Boston, MA), come up with an algorithm that will find a city for a specific ip address if a mapping exists.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits) and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
Any faults in my plans, or better solutions?
Your approach seems perfectly reasonable.
If you are interested in doing a bit of research / extra coding, there are algorithms that will asymptotically outperform the standard binary search technique that rely on the fact that your IP addresses can be interpreted as integers in the range from 0 to 231 - 1. For example, the van Emde Boas tree and y-Fast Trie data structures can implement the predecessor search operation that you're looking at in time O(log log U), where U is the maximum possible IP address, as opposed to the O(log N) approach that binary search uses. The constant factors are higher, though, which means that there is no guarantee that this approach will be any faster. However, it might be worth exploring as another approach that could potentially be even faster.
Hope this helps!
The problem smells of ranges, and one of the good data-structures for this problem would be a Segment Tree. Some resources to help you get started.
The root of the segment tree can represent the addresses (0.0.0.0 - 255.255.255.255). The left sub-tree would represent the addresses (0.0.0.0 - 127.255.255.255) and the right sub-tree would represent the range (128.0.0.0 - 255.255.255.255), and so on. This will go on till we reach ranges which cannot be further sub-divided. Say, if we have the range 32.0.0.0 - 63.255.255.255, mapped to some arbitrary city, it will be a leaf node, we will not further subdivide that range when we arrive there, and tag it to the specific city.
To search for a specific mapping, we follow the tree, just as we do in a Binary Search Tree. If your IP lies in the range of the left sub-tree, move to the left sub-tree, else move to the right sub-tree.
The good parts:
You need not have all sub-trees, only add the sub-trees which are required. For example, if in your data, there is no city mapped for the range (0.0.0.0 - 127.255.255.255), we will not construct that sub-tree.
We are space efficient. If the entire range is mapped to one city, we will create only the root node!
This is a dynamic data-structure. You can add more cities, split-up ranges later on, etc.
You will be making constant number of operations, since the maximum depth of the tree would be 4 x log2(256) = 32. For this particular problem it turns out that Segment Trees would be as fast as van-Emde Boas trees, and require lesser space (O(N)).
This is a simple, but non-trivial data-structure, which is better than sorting, because it is dynamic, and easier to explain to your interviewer than van-Emde Boas trees.
This is one of the easiest non-trivial data-structures to code :)
Please note that in some Segment Tree tutorials, they use arrays to represent the tree. This is probably not what you want, since we would not be populating the entire tree, so dynamically allocating nodes, just like we do in a standard Binary Tree is the best.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits)...
If following this approach, you would probably want to multiply by 256^3, 256^2, 256 and 1 respectively for A, B, C and D in an address A.B.C.D. That effectively recreates the IP address as a 32-bit unsigned number.
... and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
I would suggest creating a contiguous array (a std::vector) containing structs with the lower and upper ranges (and location name - discussed below). Then as you say you can binary search for a range including a specific value, without any odd/even hassles.
Using the lower end of the range as a key in a hash is one way to avoid having space for the location names in the array, but given the average number of characters in a city name, the likely size of pointers, a choice between a sparsely populated hash table and lengthly displacement lists to search in successive alternative buckets or further indirection to arbitrary length containers - you'd need to be pretty desperate to bother trying. In the first instance, storing the location in struct alongside the IP value range seems good.
Alternatively, you could create a tree based on e.g. the individual 0-255 IP values: each level in the tree could be either an array of 256 values for direct indexing, or a sorted array of populated values. That can reduce the number of IP value comparisons you're likely to need to make (O(log2N) to O(1)).
In your example, 192.168.0.0-192.168.0.255 = Boston, MA.
Will the first three octets (192.168.0) be the same for both IP addresses in the entry?
Also, will the first three octets be unique for a city?
If so, then this problem can solved more easily

Combinations of binary features (vectors)

The source data for the subject is an m-by-n binary matrix (only 0s and 1s are allowed).
m Rows represent observations, n columns - features. Some observations are marked as targets which need to be separated from the rest.
While it looks like a typical NN, SVM, etc problem, I don't need generalization. What I need is an efficient algorithm to find as many as possible combinations of columns (features) that completely separate targets from other observations, classify, that is.
For example:
f1 f2 f3
o1 1 1 0
t1 1 0 1
o2 0 1 1
Here {f1, f3} is an acceptable combo which separates target t1 from the rest (o1, o2) (btw, {f2} is NOT as by task definition a feature MUST be present in a target). In other words,
t1(f1) & t1(f3) = 1 and o1(f1) & o1(f3) = 0, o2(f1) & o2(f3) = 0
where '&' represents logical conjunction (AND).
The m is about 100,000, n is 1,000. Currently the data is packed into 128bit words along m and the search is optimized with sse4 and whatnot. Yet it takes way too long to obtain those feature combos.
After 2 billion calls to the tree descent routine it has covered about 15% of root nodes. And found about 8,000 combos which is a decent result for my particular application.
I use some empirical criteria to cut off less probable descent paths, not without limited success, but is there something radically better? Im pretty sure there gotta be?.. Any help, in whatever form, reference or suggestion, would be appreciated.
I believe the problem you describe is NP-Hard so you shouldn't expect to find the optimum solution in a reasonable time. I do not understand your current algorithm, but here are some suggestions on the top of my head:
1) Construct a decision tree. Label targets as A and non-targets as B and let the decision tree learn the categorization. At each node select the feature such that a function of P(target | feature) and P(target' | feature') is maximum. (i.e. as many targets as possible fall to positive side and as many non-targets as possible fall to negative side)
2) Use a greedy algorithm. Start from the empty set and at each time step add the feauture that kills the most non-target rows.
3) Use a randomized algorithm. Start from a small subset of positive features of some target, use the set as the seed for the greedy algorithm. Repeat many times. Pick the best solution. Greedy algorithm will be fast so it will be ok.
4) Use a genetic algorithm. Generate random seeds for the greedy algorithm as in 3 to generate good solutions and cross-product them (bitwise-and probably) to generate new candidates seeds. Remember the best solution. Keep good solutions as the current population. Repeat for many generations.
You will need to find the answer "how many of the given rows have the given feature f" fast so probably you'll need specialized data structures, perhaps using a BitArray for each feature.

Efficiently find binary strings with low Hamming distance in large set

Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
static inline int distance(unsigned x, unsigned y)
{
return __builtin_popcount(x^y);
}
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.
Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query
-- BK Tree -- -- VP Tree -- -- Linear --
Dist Results Speed Cov Speed Cov Speed Cov
1 0.90 3800 0.048% 4200 0.048%
2 11 300 0.68% 330 0.65%
3 130 56 3.8% 63 3.4%
4 970 18 12% 22 10%
5 5700 8.5 26% 10 22%
6 2.6e4 5.2 42% 6.0 37%
7 1.1e5 3.7 60% 4.1 54%
8 3.5e5 3.0 74% 3.2 70%
9 1.0e6 2.6 85% 2.7 82%
10 2.5e6 2.3 91% 2.4 90%
any 2.2 100%
In your comment, you mentioned:
I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
Dist BK Tree VP Tree Bitset Linear
1 10,133.83 15,773.69 1,905,202.76 4.73
2 677.78 1,006.95 218,624.08 4.70
3 113.14 173.15 27,022.32 4.76
4 34.06 54.13 4,239.28 4.75
5 15.21 23.81 932.18 4.79
6 8.96 13.23 236.09 4.78
7 6.52 8.37 69.18 4.77
8 5.11 6.15 23.76 4.68
9 4.39 4.83 9.01 4.47
10 3.69 3.94 2.82 4.13
Prepare 4.1s 21.0s 1.52s 0.13s
times (for building the data structure before the queries)
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.
One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.
for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE
[...]
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
THE HAMMING DISTANCE PROBLEM
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
[...]
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
Note: I posted a similar answer to a related DB-only question
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
Keep the number of union operations executed in a variable.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.
Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).
Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.

Resources