Related
I have a need to use a Cuckoo Filter but I'm not sure how to size it. I found a calculator for Bloom Filters (https://hur.st/bloomfilter/) for which I can calculate in a few ways. I can specify the approximate number of items and the desired false positive rate and it will tell me the size and number of hash functions. I'm looking for something similar for a Cuckoo Filter but I haven't found one or other instructions on how to find those numbers.
I'm looking at a Node or Python implementation. It seems the parameters to define the filter are:
filter size or capacity
bucket size
fingerprint size
I want to specify the number of elements (eg 100k) and an FPR (eg .1%) to find out the parameters needed.
Based on information in the original paper (https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf), you need to choose bucket size first, which allows you to determine fingerprint size and capacity. Bucket size is based on the desired false positive rate:
"the space-optimal bucket size depends on the target false positive
rate ε: when ε > 0.002, having two entries per bucket yields slightly
better results than using four entries per bucket; when ε decreases to
0.00001 < ε ≤ 0.002, four entries per bucket minimizes space"1
For your suggested 0.1%, that would mean a bucket size of 4.
The fingerprint size depends on bucket size and false positive rate.
"To retain the target false positive rate ε, the filter ensures 2b/2f
≤ ε, thus the minimal fingerprint size required is approximately: f ≥ log2(1/ε) + log2(2b)"1
With b bucket size, an error rate of 0.1% would require ~10 + 3 = 13 bits for a fingerprint.
Finally, capacity is determined by the number of elements divided by the maximum allowable load, which is determined by bucket size.
"With k = 2 hash functions, the load factor α is 50% when the bucket
size b = 1 (i.e., the hash table is directly mapped), but increases to
84%, 95% or 98% respectively using bucket size b = 2, 4 or 8."1
So 100k / 0.95 gives you a capacity of 106k.
I don't know of any one formula to give you these answers, since they depend on each other, but hopefully each of those steps makes sense.
For 100k elements and 0.1% FPR, that's:
filter size of 106k
bucket size of 4
fingerprint size of 13 bits
1 Bin Fan, Dave G. Andersen , Michael Kaminsky , Michael D. Mitzenmacher, Cuckoo Filter: Practically Better Than Bloom, Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, December 02-05, 2014, Sydney, Australia [doi>10.1145/2674005.2674994]
According to https://brilliant.org/wiki/cuckoo-filter/ (scroll down to "Space Complexity"), the number of bits per entry is determined by:
bitsPerEntry = (log(1/fpp)+2)/load
fpp is your False Positive Probability. load is how full you want the table to be.
So just figure out how many items you want to put in the table, multiply by the bitsPerEntry, and divide by 8. That will tell you how many bytes to allocate for your table. By applying some simple algebra, you can structure the equation to solve for any one of the unknowns.
The article says that with a load of 95.5%, you can maintain a stable false positive rate with 7 bits per entry.
The size of the fingerprint determines your error rate for the most part. As you can see in Figure 3 in the Cuckoo paper, the bucket size does not have a major effect on the accuracy. Bucket size can reduce insert time considerably since it reduces the number of relocations of existing fingerprints in occupied buckets.
I would recommend fingerprints 7, 15, 23, 31 etc' which will maximize both accuracy and speed. The reason for (8 * n) - 1 is, one bit is used to tell whether the cell is occupied at all since 0 is legal.
To answer your question, I would recommend
Capacity - what you need plus 5-10%
FingerPrint - 15 bits
Bucket size - 4
Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
static inline int distance(unsigned x, unsigned y)
{
return __builtin_popcount(x^y);
}
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.
Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query
-- BK Tree -- -- VP Tree -- -- Linear --
Dist Results Speed Cov Speed Cov Speed Cov
1 0.90 3800 0.048% 4200 0.048%
2 11 300 0.68% 330 0.65%
3 130 56 3.8% 63 3.4%
4 970 18 12% 22 10%
5 5700 8.5 26% 10 22%
6 2.6e4 5.2 42% 6.0 37%
7 1.1e5 3.7 60% 4.1 54%
8 3.5e5 3.0 74% 3.2 70%
9 1.0e6 2.6 85% 2.7 82%
10 2.5e6 2.3 91% 2.4 90%
any 2.2 100%
In your comment, you mentioned:
I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
Dist BK Tree VP Tree Bitset Linear
1 10,133.83 15,773.69 1,905,202.76 4.73
2 677.78 1,006.95 218,624.08 4.70
3 113.14 173.15 27,022.32 4.76
4 34.06 54.13 4,239.28 4.75
5 15.21 23.81 932.18 4.79
6 8.96 13.23 236.09 4.78
7 6.52 8.37 69.18 4.77
8 5.11 6.15 23.76 4.68
9 4.39 4.83 9.01 4.47
10 3.69 3.94 2.82 4.13
Prepare 4.1s 21.0s 1.52s 0.13s
times (for building the data structure before the queries)
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.
One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.
for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE
[...]
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
THE HAMMING DISTANCE PROBLEM
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
[...]
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
Note: I posted a similar answer to a related DB-only question
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
Keep the number of union operations executed in a variable.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.
Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).
Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.
Is there an algorithm that can quickly determine if a number is a factor of a given set of numbers ?
For example, 12 is a factor of [24,33,52] while 5 is not.
Is there a better approach than linear search O(n)? The set will contain a few million elements. I don't need to find the number, just a true or false result.
If a large number of numbers are checked against a constant list one possible approach to speed up the process is to factorize the numbers in the list into their prime factors first. Then put the list members in a dictionary and have the prime factors as the keys. Then when a number (potential factor) comes you first factorize it into its prime factors and then use the constructed dictionary to check whether the number is a factor of the numbers which can be potentially multiples of the given number.
I think in general O(n) search is what you will end up with. However, depending on how large the numbers are in general, you can speed up the search considerably assuming that the set is sorted (you mention that it can be) by observing that if you are searching to find a number divisible by D and you have currently scanned x and x is not divisible by D, the next possible candidate is obviously at floor([x + D] / D) * D. That is, if D = 12 and the list is
5 11 13 19 22 25 27
and you are scanning at 13, the next possible candidate number would be 24. Now depending on the distribution of your input, you can scan forwards using binary search instead of linear search, as you are searching now for the least number not less than 24 in the list, and the list is sorted. If D is large then you might save lots of comparisons in this way.
However from pure computational complexity point of view, sorting and then searching is going to be O(n log n), whereas just a linear scan is O(n).
For testing many potential factors against a constant set you should realize that if one element of the set is just a multiple of two others, it is irrelevant and can be removed. This approach is a variation of an ancient algorithm known as the Sieve of Eratosthenes. Trading start-up time for run-time when testing a huge number of candidates:
Pick the smallest number >1 in the set
Remove any multiples of that number, except itself, from the set
Repeat 2 for the next smallest number, for a certain number of iterations. The number of iterations will depend on the trade-off with start-up time
You are now left with a much smaller set to exhaustively test against. For this to be efficient you either want a data structure for your set that allows O(1) removal, like a linked-list, or just replace "removed" elements with zero and then copy non-zero elements into a new container.
I'm not sure of the question, so let me ask another: Is 12 a factor of [6,33,52]? It is clear that 12 does not divide 6, 33, or 52. But the factors of 12 are 2*2*3 and the factors of 6, 33 and 52 are 2*2*2*3*3*11*13. All of the factors of 12 are present in the set [6,33,52] in sufficient multiplicity, so you could say that 12 is a factor of [6,33,52].
If you say that 12 is not a factor of [6,33,52], then there is no better solution than testing each number for divisibility by 12; simply perform the division and check the remainder. Thus 6%12=6, 33%12=9, and 52%12=4, so 12 is not a factor of [6.33.52]. But if you say that 12 is a factor of [6,33,52], then to determine if a number f is a factor of a set ns, just multiply the numbers ns together sequentially, after each multiplication take the remainder modulo f, report true immediately if the remainder is ever 0, and report false if you reach the end of the list of numbers ns without a remainder of 0.
Let's take two examples. First, is 12 a factor of [6,33,52]? The first (trivial) multiplication results in 6 and gives a remainder of 6. Now 6*33=198, dividing by 12 gives a remainder of 6, and we continue. Now 6*52=312 and 312/12=26r0, so we have a remainder of 0 and the result is true. Second, is 5 a factor of [24,33,52]? The multiplication chain is 24%5=5, (5*33)%5=2, and (2*52)%5=4, so 5 is not a factor of [24,33,52].
A variant of this algorithm was recently used to attack the RSA cryptosystem; you can read about how the attack worked here.
Since the set to be searched is fixed any time spent organising the set for search will be time well spent. If you can get the set in memory, then I expect that a binary tree structure will suit just fine. On average searching for an element in a binary tree is an O(log n) operation.
If you have reason to believe that the numbers in the set are evenly distributed throughout the range [0..10^12] then a binary search of a sorted set in memory ought to perform as well as searching a binary tree. On the other hand, if the middle element in the set (or any subset of the set) is not expected to be close to the middle value in the range encompassed by the set (or subset) then I think the binary tree will have better (practical) performance.
If you can't get the entire set in memory then decomposing it into chunks which will fit into memory and storing those chunks on disk is probably the way to go. You would store the root and upper branches of the set in memory and use them to index onto the disk. The depth of the part of the tree which is kept in memory is something you should decide for yourself, but I'd be surprised if you needed more than the root and 2 levels of branch, giving 8 chunks on disk.
Of course, this only solves part of your problem, finding whether a given number is in the set; you really want to find whether the given number is the factor of any number in the set. As I've suggested in comments I think any approach based on factorising the numbers in the set is hopeless, giving an expected running time beyond polynomial time.
I'd approach this part of the problem the other way round: generate the multiples of the given number and search for each of them. If your set has 10^7 elements then any given number N will have about (10^7)/N multiples in the set. If the given number is drawn at random from the range [0..10^12] the mean value of N is 0.5*10^12, which suggests (counter-intuitively) that in most cases you will only have to search for N itself.
And yes, I am aware that in many cases you would have to search for many more values.
This approach would parallelise relatively easily.
A fast solution which requires some precomputation:
Organize your set in a binary tree with the following rules:
Numbers of the set are on the leaves.
The root of the tree contains r the minimum of all prime numbers that divide a number of the set.
The left subtree correspond to the subset of multiples of r (divided by r so that r won't be repeated infinitly).
The right subtree correspond to the subset of numbers not multiple of r.
If you want to test if a number N divides some element of the set, compute its prime decomposition and go through the tree until you reach a leaf. If the leaf contains a number then N divides it, else if the leaf is empty then N divides no element in the set.
Simply calculate the product of the set and mod the result with the test factor.
In your example
{24,33,52} P=41184
Tf 12: 41184 mod 12 = 0 True
Tf 5: 41184 mod 5 = 4 False
The set can be broken into chunks if calculating the product would overflow the arithmetic of the calculator, but huge numbers are possible by storing a strings.
I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)
Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
static inline int distance(unsigned x, unsigned y)
{
return __builtin_popcount(x^y);
}
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.
Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query
-- BK Tree -- -- VP Tree -- -- Linear --
Dist Results Speed Cov Speed Cov Speed Cov
1 0.90 3800 0.048% 4200 0.048%
2 11 300 0.68% 330 0.65%
3 130 56 3.8% 63 3.4%
4 970 18 12% 22 10%
5 5700 8.5 26% 10 22%
6 2.6e4 5.2 42% 6.0 37%
7 1.1e5 3.7 60% 4.1 54%
8 3.5e5 3.0 74% 3.2 70%
9 1.0e6 2.6 85% 2.7 82%
10 2.5e6 2.3 91% 2.4 90%
any 2.2 100%
In your comment, you mentioned:
I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
Dist BK Tree VP Tree Bitset Linear
1 10,133.83 15,773.69 1,905,202.76 4.73
2 677.78 1,006.95 218,624.08 4.70
3 113.14 173.15 27,022.32 4.76
4 34.06 54.13 4,239.28 4.75
5 15.21 23.81 932.18 4.79
6 8.96 13.23 236.09 4.78
7 6.52 8.37 69.18 4.77
8 5.11 6.15 23.76 4.68
9 4.39 4.83 9.01 4.47
10 3.69 3.94 2.82 4.13
Prepare 4.1s 21.0s 1.52s 0.13s
times (for building the data structure before the queries)
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.
One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.
for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE
[...]
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
THE HAMMING DISTANCE PROBLEM
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
[...]
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
Note: I posted a similar answer to a related DB-only question
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
Keep the number of union operations executed in a variable.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.
Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).
Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.