Constant time binning of values - algorithm

Say I have a vector of values that represent the upper boundaries of classes to classify (bin) values in. So e.g. vector { 1, 3, 5, 10 } represents bins [0, 1[, [1, 3[, [3, 5[ and [5,10[. How do I implement classification of a random value V in one of these classes (0,1,2,3) in constant time? It's trivial to walk the list of boundaries and stop once V surpasses the bin's upper boundary; but that's O(n) wrt the number of bins; I'm looking to do this in constant time.
I thought it was trivial before I was actually typing the code, by setting up a lookup table, dividing each V by a certain value depending on the class bounds and then using the (rounded) result of the division to find the bin number in the lookup table. But I'm finding it a lot harder than I thought to made this in a generic way that minimizes the size of the lookup table while still being accurate, regardless of the proportional distance between bin boundaries; and in a way that works for all real values. With Google'ing I only find algorithms that determine the boundaries of the bins, at least using the terms I did.

I doubt there's a way to do this in strictly constant time (and not requiring infinite space) without taking advantage of some property of the given numbers.
A lookup table is a decent idea, but floating point values makes this difficult. If the number of digits is finite, you can consider is having the lookup table represented as essentially a trie (a tree where each level represents a digit).
So for {1, 2.5, 5, 9}, your tree would look something like this:
root
/ / / / | \ \ \ \ \
0 1 2 3 4 5 6 7 8 9
/ | \
2.0 ... 2.5 ... 2.9
Each leaf node would contain a value indicating which interval it belongs to, so
0 will be set to 0,
1, 2.0 - 2.4 will all be set to 1,
2.5 - 2.9, 3 - 4 will be set to 2,
5 - 9 will be set to 3
A query would just involve starting from the root and repeatedly going to the child node corresponding to the next digit in the number we're looking up (if you look up 2.65 in the above tree, you first go to 2, then 2.6, then, since it's a leaf, you stop and return it's value, which is 1).
The time complexity for a query would be O(d), where d is the number of significant digits in your vector, and the space complexity is O(nd).
That might not sound particularly efficient, but keep in mind that d is the number of digits - for example, that would be d = log m with m being the maximum possible value if we're talking about positive integers.
O(log n) is fairly trivial if you just set up a binary search tree (BST) containing all the values in the vector mapped to their original indices.
A lookup would look very similar to how you'd search a BST - start from the root and go either left or right until you find the value, except in this case you note every node you visit and return the mapped index of the closest value that's not bigger. Some API's have methods that basically do this for you (such as std::map in C++).

I think the only way to get O(1) is to create a lookup table so that you can look up all the values directly.
This is only feasable if the boundaries are behaving nicely:
The expected numbers are integers or the boundaries are integers or have limited precision. This allows you to round down (floor) the number before checking against the lookup table and drastically reduces the required entries for the table.
The difference between the max and min boundary cannot be too big. Let's say we know that the precision of the boundaries is 0.5 and the min is 1 and the max is 10, then the lookup table requires (10-1)/0.5 = 18 entries.
The checks for the first and last group (smaller than min and greater than max) is done with simple if checks which doesn't affect the complexity.

Related

Searching in vector of pairs

I have a vector of pairs (datatype=double), where each pair is (a,b) and a less than b.For a number x, I want to find out number of pair in vector, where a<=x<=b.
Consider the vector size about 10^6.
My Approach
Sort the vector pair and perform a lower_bound operation for x over "a" in pair then iterate from start till my lower bound value and check for values of "b" which satisfies condition of x<=b.
Time Complexity
N(LogN) where N is vector size.
Issue
I have to perform this over large queries where this approach becomes inefficient.So is there any better solution to decrease the time complexity.
Sorry for my poor English and question formatting.
In addition to the previous answer, here's a suggestion how to prepare the ranges to optimize the subsequent lookup. The idea boils down to precomputing the result for all significantly different input values, but being smart about when values don't differ significantly.
To illustrate what I mean, let's consider this sequence of ranges:
1, 3
1, 8
2, 4
2, 6
The prepared output structure then looks like this:
1, 2 -> 2
2, 3 -> 4
3, 4 -> 3
4, 6 -> 2
6, 8 -> 1
For any number in the range 1, 2, there are two matching ranges in the initial sequence. For any number in the range 2, 3, there are four matches, etc. Note that there are five ranges here now, because some of the input ranges partially overlapped. Since for every range here the end value is also the start value of the next range, the end value can be optimized out. The result then looks like a simple map:
1 -> 2
2 -> 4
3 -> 3
4 -> 2
6 -> 1
8 -> 0
Note here that the last range didn't have one following, so the explicit zero becomes necessary. For the values before the first, that is implied. In order to find the result for a value, just find the key that is less than or equal to that value. This is a simple O(log n) lookup.
Firstly, if you just did a simple scan over the pairs, you would have O(n) complexity! The O(n log n) comes from sorting and for a one-off operation this is just overhead. This might even be the best way to do it, if you don't reuse the results and even if you just perform a few queries, it might still be better than sorting. Make sure you allow yourself to switch out the algorithm.
Anyhow, let's consider that you need to make many queries. Then, one relatively obvious step to improve things is to not iterate step-by-step after sorting. Instead, you can do a binary search for the lower bound. Simply partition the sequence into halves. The lower bound can be found in either half, which you can determine by looking at the middle element between the partitions. Recurse until you found the first element that can not possibly contain the value you search, because its start value is already greater.
Concerning the other direction, things are not that easy. Just because you sorted the ranges by the start value doesn't imply that the end values are sorted, too. Also, ranges that match and ranges that don't can be mixed in the sequence, so here you will have to perform a linear scan.
Lastly, some notes:
You could parallelize this algorithm using multithreading.
Depending on your number of searches M in your outer loop, you could also switch the outer loop with the inner one. That means that for every pair of the input vector, you check each of the M search values whether they fall within the range. This might be better, in particular when the M searches fit into the CPU cache.
This is a very typical style problem in for segment trees, binary indexed trees, interval trees.
There are two operations that you have to carry out on an array arr.
You have two operations on an array arr:
1. Range update: Add(a, b): for(int i = a; i <= b; ++i) arr[i]++
2. Point query : Query(x): return arr[x]
Alternately, you could formulate your problem slightly cleverly.
1. Point Update: Add(a, b): arr[a]++; arr[b+1]--;
2. Range Query: Query(x): return sum(arr[0], arr[1] ..... arr[x]);
In each of the cases above, you have one O(n) operation and one O(1) operation.
For the second case, the query is essentially a prefix sum calculation. Binary Indexed Trees are especially efficient at this task.
Tutorial for Binary Indexed Trees
IMPORTANT IDEA: ARRAY COMPRESSION
You did mention that the vector size is about 10^6, so there is a chance that you may not be able to create an array that big. If you are able to create a set that consists of all the as and bs and xs beforehand, then you can translate them into numbers from 1 to size of set.
SUPER CLEVER IDEA: MO's ALGORITHM
This is only allowed if you are allowed to solve the problem offline. What that means is that you can take all the query points x as input, solve them in any order as you like and store the solution, and then print the solution in the correct order.
Please mention if this is your situation, and only then will I elaborate further on this. But Binary Indexed Trees are going to be more efficient than Mo's algorithm.
EDIT:
Because your interval values are of type double, you must convert them to integers before you use my solution. Let me give an example,
Intervals = (1.1 to 1.9), (1.4 to 2.1)
Query Points = 1.5, 2.0
Here all the points that are of interest are not all the possible doubles, but just the above numbers = {1.1, 1.4, 1.5, 1.9, 2.0, 2.1}
If we map them into positive integers:
1.1 --> 1
1.4 --> 2
1.5 --> 3
1.9 --> 4
2.0 --> 5
2.1 --> 6
Then you could use segment trees/binary indexed trees.
For each pair a,b you can decompose so that a=+1 and b=-1 for the number of ranges valid for a particular value. Then in becomes a simple O(log n) lookup to see how many ranges encompass the search value.

How to find all the binary string of length 9, having 4 ones and rest zeroes and hamming distance of 4 (if we consider any two strings) [duplicate]

Problem:
Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.
Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.
Example:
For a maximum Hamming Distance of 1 (values typically will be quite small)
And input:
00001000100000000000000001111101
The values:
01001000100000000000000001111101
00001000100000000010000001111101
should match because there is only 1 position in which the bits are different.
11001000100000000010000001111101
should not match because 3 bit positions are different.
My thoughts so far:
For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.
If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.
How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.
Question: What do we know about the Hamming distance d(x,y)?
Answer:
It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)
Question: Why do we care?
Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.
Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)
You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.
Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:
static inline int distance(unsigned x, unsigned y)
{
return __builtin_popcount(x^y);
}
If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.
Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.
Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.
The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.
Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query
-- BK Tree -- -- VP Tree -- -- Linear --
Dist Results Speed Cov Speed Cov Speed Cov
1 0.90 3800 0.048% 4200 0.048%
2 11 300 0.68% 330 0.65%
3 130 56 3.8% 63 3.4%
4 970 18 12% 22 10%
5 5700 8.5 26% 10 22%
6 2.6e4 5.2 42% 6.0 37%
7 1.1e5 3.7 60% 4.1 54%
8 3.5e5 3.0 74% 3.2 70%
9 1.0e6 2.6 85% 2.7 82%
10 2.5e6 2.3 91% 2.4 90%
any 2.2 100%
In your comment, you mentioned:
I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.
I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.
A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.
I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.
For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.
I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:
Dist BK Tree VP Tree Bitset Linear
1 10,133.83 15,773.69 1,905,202.76 4.73
2 677.78 1,006.95 218,624.08 4.70
3 113.14 173.15 27,022.32 4.76
4 34.06 54.13 4,239.28 4.75
5 15.21 23.81 932.18 4.79
6 8.96 13.23 236.09 4.78
7 6.52 8.37 69.18 4.77
8 5.11 6.15 23.76 4.68
9 4.39 4.83 9.01 4.47
10 3.69 3.94 2.82 4.13
Prepare 4.1s 21.0s 1.52s 0.13s
times (for building the data structure before the queries)
For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.
I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.
My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.
One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).
A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.
You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.
So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:
create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"
slice your query bit string the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.
for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)
The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice.
Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.
Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.
If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.
There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.
This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:
APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE
[...]
Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:
For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.
Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":
3.3 The Results for Algorithm C
We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.
This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:
THE HAMMING DISTANCE PROBLEM
Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)
[...]
Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.
Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).
The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.
Note: I posted a similar answer to a related DB-only question
You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."
For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.
Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.
How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?
One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:
For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.
For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.
If the value is in the original input, do a "union" operation with the list member.
Keep the number of union operations executed in a variable.
You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.
Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.
To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.
Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.
Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).
Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.
If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.
I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.

Randomly choosing from a list with weighted probabilities

I have an array of N elements (representing the N letters of a given alphabet), and each cell of the array holds an integer value, that integer value meaning the number of occurrences in a given text of that letter. Now I want to randomly choose a letter from all of the letters in the alphabet, based on his number of appearances with the given constraints:
If the letter has a positive (nonzero) value, then it can be always chosen by the algorithm (with a bigger or smaller probability, of course).
If a letter A has a higher value than a letter B, then it has to be more likely to be chosen by the algorithm.
Now, taking that into account, I've come up with a simple algorithm that might do the job, but I was just wondering if there was a better thing to do. This seems to be quite fundamental, and I think there might be more clever things to do in order to accomplish this more efficiently. This is the algorithm i thought:
Add up all the frequencies in the array. Store it in SUM
Choosing up a random value from 0 to SUM. Store it in RAN
[While] RAN > 0, Starting from the first, visit each cell in the array (in order), and subtract the value of that cell from RAN
The last visited cell is the chosen one
So, is there a better thing to do than this? Am I missing something?
I'm aware most modern computers can compute this so fast I won't even notice if my algorithm is inefficient, so this is more of a theoretical question rather than a practical one.
I prefer an explained algorithm rather than just code for an answer, but If you're more comfortable providing your answer in code, I have no problem with that.
The idea:
Iterate through all the elements and set the value of each element as the cumulative frequency thus far.
Generate a random number between 1 and the sum of all frequencies
Do a binary search on the values for this number (finding the first value greater than or equal to the number).
Example:
Element A B C D
Frequency 1 4 3 2
Cumulative 1 5 8 10
Generate a random number in the range 1-10 (1+4+3+2 = 10, the same as the last value in the cumulative list), do a binary search, which will return values as follows:
Number Element returned
1 A
2 B
3 B
4 B
5 B
6 C
7 C
8 C
9 D
10 D
The Alias Method has amortized O(1) time per value generated, but requires two uniforms per lookup. Basically, you create a table where each column contains one of the values to be generated, a second value called an alias, and a conditional probability of choosing between the value and its alias. Use your first uniform to pick any of the columns with equal likelihood. Then choose between the primary value and the alias based on your second uniform. It takes a O(n log n) work to initially set up a valid table for n values, but after the table's built generating values is constant time. You can download this Ruby gem to see an actual implementation.
Two other very fast methods by Marsaglia et al. are described here. They have provided C implementations.

Binary search for no uniform distribution

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.
Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.
There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.
One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.
Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.
Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.
Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.
A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.
Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.
The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.
One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.
A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.
Let me make it precise. What you want for binary search is:
Given array A which is sorted, but have non-uniform distribution
Given left & right index L & R of search range
Want to search for a value X in A
To apply binary search, we want to find the index M in [L,R]
as the next position to look at.
Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]
In general, you of course want to pick M where you think X value should be in A.
Because even if you miss, half the total 'chance' would be eliminated.
So it seems to me you have some expectation about distribution.
If you could tell us what exactly do you mean by '1/x distribution', then
maybe someone here can help build on my suggestion for you.
Let me give a worked example.
I'll use similar interpretation of '1/x distribution' as #Leonid Volnitsky
Here is a Python code that generate the input array A
from random import uniform
# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()
# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]
Let assume the value to search for is:
X = 0.0553
Then the estimated index of X is:
= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)
So how to calculate P(x <= X) ?
It this case it is simple.
We reverse X back to the value between [a,b] which we will call
X' = 1/X ~ 18
Hence
P(x <= X) = (b-X')/(b-a)
= (20-18)/(20-10)
= 2/10
So the expected position of X is:
10*(2/10) = 2
Well, and that's pretty damn accurate!
To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.
I know this might not seems like a binary search anymore
if you can get that close to the answer in just one step.
But admit it, this is what you can do if you know the distribution of input array.
The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.
You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.
This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.
Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:
30
/ \
20 50
/ / \
10 40 60
Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:
f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]
Into a new search tree, that looks like:
10 -------------> the most probable of being taken
/ \ leaving v = [[20, 30], [40, 50, 60]]
20 30 ---------> the most probable of being taken
/ \ leaving v = [[40, 50], [60]]
40 50 -------> the most probable of being taken
/ leaving v = [[60]]
60
If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by #Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.
I will assume from your description:
X is uniformly distributed
Y=1/X is your data which you want to search and it is stored in sorted table
given value y, you need to binary search it in the above table
Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.
For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.
To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.
Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.
The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).
In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).
Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N).
We are looking for key K and want index I so that
[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.
Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).
The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.
Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).
As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).
If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].
The above proportionally weighted binary search can used to improve on the initial probe.

Quickly determine if number divides any element in a set

Is there an algorithm that can quickly determine if a number is a factor of a given set of numbers ?
For example, 12 is a factor of [24,33,52] while 5 is not.
Is there a better approach than linear search O(n)? The set will contain a few million elements. I don't need to find the number, just a true or false result.
If a large number of numbers are checked against a constant list one possible approach to speed up the process is to factorize the numbers in the list into their prime factors first. Then put the list members in a dictionary and have the prime factors as the keys. Then when a number (potential factor) comes you first factorize it into its prime factors and then use the constructed dictionary to check whether the number is a factor of the numbers which can be potentially multiples of the given number.
I think in general O(n) search is what you will end up with. However, depending on how large the numbers are in general, you can speed up the search considerably assuming that the set is sorted (you mention that it can be) by observing that if you are searching to find a number divisible by D and you have currently scanned x and x is not divisible by D, the next possible candidate is obviously at floor([x + D] / D) * D. That is, if D = 12 and the list is
5 11 13 19 22 25 27
and you are scanning at 13, the next possible candidate number would be 24. Now depending on the distribution of your input, you can scan forwards using binary search instead of linear search, as you are searching now for the least number not less than 24 in the list, and the list is sorted. If D is large then you might save lots of comparisons in this way.
However from pure computational complexity point of view, sorting and then searching is going to be O(n log n), whereas just a linear scan is O(n).
For testing many potential factors against a constant set you should realize that if one element of the set is just a multiple of two others, it is irrelevant and can be removed. This approach is a variation of an ancient algorithm known as the Sieve of Eratosthenes. Trading start-up time for run-time when testing a huge number of candidates:
Pick the smallest number >1 in the set
Remove any multiples of that number, except itself, from the set
Repeat 2 for the next smallest number, for a certain number of iterations. The number of iterations will depend on the trade-off with start-up time
You are now left with a much smaller set to exhaustively test against. For this to be efficient you either want a data structure for your set that allows O(1) removal, like a linked-list, or just replace "removed" elements with zero and then copy non-zero elements into a new container.
I'm not sure of the question, so let me ask another: Is 12 a factor of [6,33,52]? It is clear that 12 does not divide 6, 33, or 52. But the factors of 12 are 2*2*3 and the factors of 6, 33 and 52 are 2*2*2*3*3*11*13. All of the factors of 12 are present in the set [6,33,52] in sufficient multiplicity, so you could say that 12 is a factor of [6,33,52].
If you say that 12 is not a factor of [6,33,52], then there is no better solution than testing each number for divisibility by 12; simply perform the division and check the remainder. Thus 6%12=6, 33%12=9, and 52%12=4, so 12 is not a factor of [6.33.52]. But if you say that 12 is a factor of [6,33,52], then to determine if a number f is a factor of a set ns, just multiply the numbers ns together sequentially, after each multiplication take the remainder modulo f, report true immediately if the remainder is ever 0, and report false if you reach the end of the list of numbers ns without a remainder of 0.
Let's take two examples. First, is 12 a factor of [6,33,52]? The first (trivial) multiplication results in 6 and gives a remainder of 6. Now 6*33=198, dividing by 12 gives a remainder of 6, and we continue. Now 6*52=312 and 312/12=26r0, so we have a remainder of 0 and the result is true. Second, is 5 a factor of [24,33,52]? The multiplication chain is 24%5=5, (5*33)%5=2, and (2*52)%5=4, so 5 is not a factor of [24,33,52].
A variant of this algorithm was recently used to attack the RSA cryptosystem; you can read about how the attack worked here.
Since the set to be searched is fixed any time spent organising the set for search will be time well spent. If you can get the set in memory, then I expect that a binary tree structure will suit just fine. On average searching for an element in a binary tree is an O(log n) operation.
If you have reason to believe that the numbers in the set are evenly distributed throughout the range [0..10^12] then a binary search of a sorted set in memory ought to perform as well as searching a binary tree. On the other hand, if the middle element in the set (or any subset of the set) is not expected to be close to the middle value in the range encompassed by the set (or subset) then I think the binary tree will have better (practical) performance.
If you can't get the entire set in memory then decomposing it into chunks which will fit into memory and storing those chunks on disk is probably the way to go. You would store the root and upper branches of the set in memory and use them to index onto the disk. The depth of the part of the tree which is kept in memory is something you should decide for yourself, but I'd be surprised if you needed more than the root and 2 levels of branch, giving 8 chunks on disk.
Of course, this only solves part of your problem, finding whether a given number is in the set; you really want to find whether the given number is the factor of any number in the set. As I've suggested in comments I think any approach based on factorising the numbers in the set is hopeless, giving an expected running time beyond polynomial time.
I'd approach this part of the problem the other way round: generate the multiples of the given number and search for each of them. If your set has 10^7 elements then any given number N will have about (10^7)/N multiples in the set. If the given number is drawn at random from the range [0..10^12] the mean value of N is 0.5*10^12, which suggests (counter-intuitively) that in most cases you will only have to search for N itself.
And yes, I am aware that in many cases you would have to search for many more values.
This approach would parallelise relatively easily.
A fast solution which requires some precomputation:
Organize your set in a binary tree with the following rules:
Numbers of the set are on the leaves.
The root of the tree contains r the minimum of all prime numbers that divide a number of the set.
The left subtree correspond to the subset of multiples of r (divided by r so that r won't be repeated infinitly).
The right subtree correspond to the subset of numbers not multiple of r.
If you want to test if a number N divides some element of the set, compute its prime decomposition and go through the tree until you reach a leaf. If the leaf contains a number then N divides it, else if the leaf is empty then N divides no element in the set.
Simply calculate the product of the set and mod the result with the test factor.
In your example
{24,33,52} P=41184
Tf 12: 41184 mod 12 = 0 True
Tf 5: 41184 mod 5 = 4 False
The set can be broken into chunks if calculating the product would overflow the arithmetic of the calculator, but huge numbers are possible by storing a strings.

Resources