Binary search for no uniform distribution - performance

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.
Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.

There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.
One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.
Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.
Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.
Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.
A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.
Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.
The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.
One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.
A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.

Let me make it precise. What you want for binary search is:
Given array A which is sorted, but have non-uniform distribution
Given left & right index L & R of search range
Want to search for a value X in A
To apply binary search, we want to find the index M in [L,R]
as the next position to look at.
Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]
In general, you of course want to pick M where you think X value should be in A.
Because even if you miss, half the total 'chance' would be eliminated.
So it seems to me you have some expectation about distribution.
If you could tell us what exactly do you mean by '1/x distribution', then
maybe someone here can help build on my suggestion for you.
Let me give a worked example.
I'll use similar interpretation of '1/x distribution' as #Leonid Volnitsky
Here is a Python code that generate the input array A
from random import uniform
# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()
# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]
Let assume the value to search for is:
X = 0.0553
Then the estimated index of X is:
= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)
So how to calculate P(x <= X) ?
It this case it is simple.
We reverse X back to the value between [a,b] which we will call
X' = 1/X ~ 18
Hence
P(x <= X) = (b-X')/(b-a)
= (20-18)/(20-10)
= 2/10
So the expected position of X is:
10*(2/10) = 2
Well, and that's pretty damn accurate!
To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.
I know this might not seems like a binary search anymore
if you can get that close to the answer in just one step.
But admit it, this is what you can do if you know the distribution of input array.

The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.

You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.
This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.
Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:
30
/ \
20 50
/ / \
10 40 60
Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:
f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]
Into a new search tree, that looks like:
10 -------------> the most probable of being taken
/ \ leaving v = [[20, 30], [40, 50, 60]]
20 30 ---------> the most probable of being taken
/ \ leaving v = [[40, 50], [60]]
40 50 -------> the most probable of being taken
/ leaving v = [[60]]
60
If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by #Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.

I will assume from your description:
X is uniformly distributed
Y=1/X is your data which you want to search and it is stored in sorted table
given value y, you need to binary search it in the above table
Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.
For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.
To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.

Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.
The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).
In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).
Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N).
We are looking for key K and want index I so that
[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.
Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).
The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.
Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).
As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).
If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].
The above proportionally weighted binary search can used to improve on the initial probe.

Related

Updating & Querying all elements in array >= X where X is variable fast

Formally we are given an array with some initial values. Then we have 3 types of Queries :-
Point updates : Increment by 1 at a given position
Range Queries : To count number of elements>=x where x is taken as input
Range Updates : To decrement by 1 all elements>=x, where x is given as input.
N=105 , Q=105 (number of elements in array, number of Queries resp.)
I tried doing this with segment Tree but operations 2,3 can be worse than O(n) even as we don't know which 'range' is to be updated exactly so we may end up traversing whole of segment tree.
NOTE : I wish to clear that if we need to do all 3 operations in logarithmic Worst case ,ie O(log n) ,cause only then we can do this fast , linear approach doesn't works as Q=10^5 n N=10^5 , so worst case could be O(n^2) ,ie 10^10 operation which is clearly not feasible.
Given that you're talking about 105 items, and don't mention needing to add or remove items, it seems to me that the obvious data structure would be a simple sorted vector.
Operation complexities:
point update: O(1) + O(m) (where m is the number of subsequent elements equal to the value before the update).
Range query: O(log n) + O(m) (where n is start of range, m is elements in range).
Range update (same as range query).
It's a little difficult to be sure what "fast" means to you, but the fastest theoretically possible for 1 is O(1), so we're already within some constant factor of optimal.
For 2 and 3, even if we could do the find with constant complexity, we're pretty much stuck with O(m) for the update. Since Log2100000 = ~16.6, most of the time the O(m) term is going to dominate (i.e., the update part will involve as many operations as the search unless the given x is one of the last 17 items in the collection.
I doubt there's any point for this small of a collection, but if you might have to deal with a substantially larger collection and the items in the collection are reasonably predictably distributed, it might be worth considering doing an interpolating search instead of a binary search. With predictable distribution this reduces the expected number of comparisons to approximately O(log log n). In this case, that would be roughly 4 (but normally with a higher constant factor). This might be a win for 105 items, but then again it might not. If you might have to deal with a collection of (say) 108 items or more, it would be much more likely to be a substantial win.
The following may not be optimal, but is the best I could think of tonight.
Let's start by trying to turn the problem sideways. Instead of a map from indices to values, let's consider a map from values to sets of indices. A point update now involves removing an index from one set and adding it to another. A range update involves either simply moving an index set from one value to another or taking the union of two index sets. A range query involves folding over the sets corresponding to the values in range. A quick peek at Wikipedia suggests a traditional disjoint-set data structure is really great for set unions. Unfortunately, it's no good at all for removing an element from a set.
Fortunately, there is a newer data structure supporting union-find with constant time deletion! That takes care of both point updates and range updates quite naturally. Range queries, unfortunately, will require checking all array elements, even if very few elements are in range.

searching in a sorted array with less complexity than binary search

To search a very large array,I was thinking for an algorithm with complexity less than log n ,means not of order less than log n but absolute less than log n.So what I did is instead of going to the middle just move 1 step forward and check how much we have to move further if numbers are evenly distibuted,move tto that position,if this is a solution break it otherwise calculate how much we have to move futher,do it iteratively until the solution is found
Here's a working Java code:-
public class Search {
public static void main(String[] args) {
int a[]={12,15,16,17,19,20,26,27};
int required=27;
int pointer=0;
int n=1;
int diff;
int count=0;
int length=a.length;
while(a[pointer]!=required){
count++;
if ((pointer+n)>(length-1))
n=length-1-pointer;
if(n==0)
n=-1;
diff=a[pointer+n]-a[pointer];
pointer=pointer+n;
n=(required-a[pointer])*n/diff;
}
System.out.println(pointer);
System.out.println(count);
}
}
P.S- I have an array which is near to evenly distributed.
I want to ask is it really better than binary search??In which cases it will fail?What is the best,avg and worst case complexity??
You are using a heuristic to try to accelerate your sort. A heuristic is like a guess. It isn't guaranteed to be right - but if the heuristic is a good one can accelerate an algorithm in the general case.
Heuristics generally won't improve worst case running time of an algorithm. That is - it is possible for some inputs for the heuristic to be wrong.
I can see the intuitive appeal of what you are doing - you are "searching" closer to where you think your target might be.
But there are two problems with what you are doing:
Moving the "split" in a binary search closer to the target does not speed up the search. In a binary search you split the search space in half each time. When you move the split point closer to the target, you have not found the target, and it is as likely as no that you target is in the larger of the two unequal spaces.
For example suppose you have the follow array. y is your target, x is all the other values:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxYxx
In a binary search you would split the space in half and then half again in the first two decisions:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxYxx
^ ^
After two decisions your 32 value array is down to a search space of 8 values. But suppose with your heuristic, that after the second choice you put the split after the y?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxYxx
^ ^
After your second decision you have only reduced the search space a little bit. By adding this heuristic you have reduced the worst case running time to N - because it is possible to construct inputs that will fool your heuristic into making the worst guess every time.
The other problem is that heuristic methods to accelerate searches only help when you know something about what you are searching. Take dictionary searching. You know that z is at the end of alphabet. So when you get a word that starts with z, you know fairly well where in the dictionary the z words are. You don't have to start in the middle of the dictionary.
This is because you know something about the distribution of the words in a dictionary. But if someone made no guarantees about the words in a list - then you can't guarantee that dictionary search is faster - you might for example receive a list of all z words.
In your case your heuristic is not particularly good. You're guessing where the next split is based on the distance between the current split and the previous value. The only time that would be a good guess is if the elements in the list were evenly spaced. If they are unevenly spaced ( almost always ) then some guesses will always overshoot the split and other undershoot.
In any sorted array of unevenly spaced numbers there will necessarily be intervals that are more tightly spaced than average, and intervals more sparse than average. Your heuristic guesses at the average sparseness of the numbers at the current split to the end of the array. There is no relationship between those two things.
Update:
Your best case time: O(1) - e.g. you guess the index right off.
Worst case: O(N) - e.g. every choice is worst possible.
You added that your array is nearly evenly spaced and very large. My guess as to what in practice would be fastest: look up the first number and last number in the array, and the length of the array. Make an educated guess at the offset of your target:
offset = floor((( target - first ) / ( last - first )) * length );
Chose a reasonable search space around the target:
window_start = floor( offset * ( 1 - alpha ));
window_end = floor( offset * ( 1 + alpha ));
Do a binary search on the sub-array defined by this window.
What you set alpha to will depend on how regular you think your array is. E.g. you can set to to 0.05 to search a window which is roughly 10% of the total search space around your estimated target.
If you can make some guarantees about evenness of the input you might be able to tune alpha optimally.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

Quickly determine if number divides any element in a set

Is there an algorithm that can quickly determine if a number is a factor of a given set of numbers ?
For example, 12 is a factor of [24,33,52] while 5 is not.
Is there a better approach than linear search O(n)? The set will contain a few million elements. I don't need to find the number, just a true or false result.
If a large number of numbers are checked against a constant list one possible approach to speed up the process is to factorize the numbers in the list into their prime factors first. Then put the list members in a dictionary and have the prime factors as the keys. Then when a number (potential factor) comes you first factorize it into its prime factors and then use the constructed dictionary to check whether the number is a factor of the numbers which can be potentially multiples of the given number.
I think in general O(n) search is what you will end up with. However, depending on how large the numbers are in general, you can speed up the search considerably assuming that the set is sorted (you mention that it can be) by observing that if you are searching to find a number divisible by D and you have currently scanned x and x is not divisible by D, the next possible candidate is obviously at floor([x + D] / D) * D. That is, if D = 12 and the list is
5 11 13 19 22 25 27
and you are scanning at 13, the next possible candidate number would be 24. Now depending on the distribution of your input, you can scan forwards using binary search instead of linear search, as you are searching now for the least number not less than 24 in the list, and the list is sorted. If D is large then you might save lots of comparisons in this way.
However from pure computational complexity point of view, sorting and then searching is going to be O(n log n), whereas just a linear scan is O(n).
For testing many potential factors against a constant set you should realize that if one element of the set is just a multiple of two others, it is irrelevant and can be removed. This approach is a variation of an ancient algorithm known as the Sieve of Eratosthenes. Trading start-up time for run-time when testing a huge number of candidates:
Pick the smallest number >1 in the set
Remove any multiples of that number, except itself, from the set
Repeat 2 for the next smallest number, for a certain number of iterations. The number of iterations will depend on the trade-off with start-up time
You are now left with a much smaller set to exhaustively test against. For this to be efficient you either want a data structure for your set that allows O(1) removal, like a linked-list, or just replace "removed" elements with zero and then copy non-zero elements into a new container.
I'm not sure of the question, so let me ask another: Is 12 a factor of [6,33,52]? It is clear that 12 does not divide 6, 33, or 52. But the factors of 12 are 2*2*3 and the factors of 6, 33 and 52 are 2*2*2*3*3*11*13. All of the factors of 12 are present in the set [6,33,52] in sufficient multiplicity, so you could say that 12 is a factor of [6,33,52].
If you say that 12 is not a factor of [6,33,52], then there is no better solution than testing each number for divisibility by 12; simply perform the division and check the remainder. Thus 6%12=6, 33%12=9, and 52%12=4, so 12 is not a factor of [6.33.52]. But if you say that 12 is a factor of [6,33,52], then to determine if a number f is a factor of a set ns, just multiply the numbers ns together sequentially, after each multiplication take the remainder modulo f, report true immediately if the remainder is ever 0, and report false if you reach the end of the list of numbers ns without a remainder of 0.
Let's take two examples. First, is 12 a factor of [6,33,52]? The first (trivial) multiplication results in 6 and gives a remainder of 6. Now 6*33=198, dividing by 12 gives a remainder of 6, and we continue. Now 6*52=312 and 312/12=26r0, so we have a remainder of 0 and the result is true. Second, is 5 a factor of [24,33,52]? The multiplication chain is 24%5=5, (5*33)%5=2, and (2*52)%5=4, so 5 is not a factor of [24,33,52].
A variant of this algorithm was recently used to attack the RSA cryptosystem; you can read about how the attack worked here.
Since the set to be searched is fixed any time spent organising the set for search will be time well spent. If you can get the set in memory, then I expect that a binary tree structure will suit just fine. On average searching for an element in a binary tree is an O(log n) operation.
If you have reason to believe that the numbers in the set are evenly distributed throughout the range [0..10^12] then a binary search of a sorted set in memory ought to perform as well as searching a binary tree. On the other hand, if the middle element in the set (or any subset of the set) is not expected to be close to the middle value in the range encompassed by the set (or subset) then I think the binary tree will have better (practical) performance.
If you can't get the entire set in memory then decomposing it into chunks which will fit into memory and storing those chunks on disk is probably the way to go. You would store the root and upper branches of the set in memory and use them to index onto the disk. The depth of the part of the tree which is kept in memory is something you should decide for yourself, but I'd be surprised if you needed more than the root and 2 levels of branch, giving 8 chunks on disk.
Of course, this only solves part of your problem, finding whether a given number is in the set; you really want to find whether the given number is the factor of any number in the set. As I've suggested in comments I think any approach based on factorising the numbers in the set is hopeless, giving an expected running time beyond polynomial time.
I'd approach this part of the problem the other way round: generate the multiples of the given number and search for each of them. If your set has 10^7 elements then any given number N will have about (10^7)/N multiples in the set. If the given number is drawn at random from the range [0..10^12] the mean value of N is 0.5*10^12, which suggests (counter-intuitively) that in most cases you will only have to search for N itself.
And yes, I am aware that in many cases you would have to search for many more values.
This approach would parallelise relatively easily.
A fast solution which requires some precomputation:
Organize your set in a binary tree with the following rules:
Numbers of the set are on the leaves.
The root of the tree contains r the minimum of all prime numbers that divide a number of the set.
The left subtree correspond to the subset of multiples of r (divided by r so that r won't be repeated infinitly).
The right subtree correspond to the subset of numbers not multiple of r.
If you want to test if a number N divides some element of the set, compute its prime decomposition and go through the tree until you reach a leaf. If the leaf contains a number then N divides it, else if the leaf is empty then N divides no element in the set.
Simply calculate the product of the set and mod the result with the test factor.
In your example
{24,33,52} P=41184
Tf 12: 41184 mod 12 = 0 True
Tf 5: 41184 mod 5 = 4 False
The set can be broken into chunks if calculating the product would overflow the arithmetic of the calculator, but huge numbers are possible by storing a strings.

Reducing the Average Number of Comparisons in Selection

The problem here is to reduce the average number of comparisons need in a selection sort.
I am reading an article on this and here is text snippet:
More generally, a sample S' of s elements is chosen from the n
elements. Let "delta" be some number, which we will choose later so
as to minimize the average number of comparisons used by the
procedure. We find the (v1 = (k * s)/(n - delta))th and (v2 = (k* * s)/(n + delta)
)th smallest elements in S'. Almost certainly, the kth smallest
element in S will fall between v1 and v2, so we are left with a
selection problem on (2 * delta) elements. With low probability, the
kth smallest element does not fall in this range, and we have
considerable work to do. However, with a good choice of s and delta,
we can ensure, by the laws of probability, that the second case does
not adversely affect the total work.
I do not follow the above text. Can anyone please explain to me with examples. How did the author reduce to 2 * delta elements? And how does he know that there is a low probablity that element does not fall into this category.
Thanks!
The basis for the idea is that the normal selection algorithm has linear runtime complexity, but in practical terms is slow. We need to sort all the elements in groups of five, and recursively do even more work. O(n) but with too large a constant. The idea then, is to reduce the number of comparisons in the selection algorithm (not a selection sort necessarily). Intuitively it is the same as in basic statistics; if I take a sample subspace of large enough proportion, it is likely that the distribution of data in the subspace adequately reflects the data in the whole space.
So if I'm looking for the kth number in a set of size one million, I could instead take say 10 000 (already one hundredth the size), which is still large enough to be a good representation of the global distribution, and look for the k/100th number. That's simple scaling. So if the space was 10 and I was looking for the 3rd, that's like looking for the 30th in 100, or the 300th in 1000, etc. Essentially k/S = k'/S' (where we're looking for the kth number in S, and we translate that to the k'th number in S' our subspace) and therefore k' = k*S'/S which should look familiar, since in the text you quoted S' is denoted by s, and S by n, and that's the same fraction quoted.
Now in order to take statistical fluctuations into account, we don't assume that the subspace will be a perfect representation of the data's distribution, so we allow for some fluctuation, namely, delta. We say let's find the k'th-delta and k'th+delta elements in S', and then we can say with great certainty (i.e. high mathematical probability) that the kth value from S is in the interval (k'th-delta, k'th+delta).
To wrap it all up we perform these two selections on S', then partition S accordingly, and now do [normal] selection on the much smaller interval in the partition. This ends up being almost optimal for the elements outside the interval, because we don't do selection on those, only partition them. So the selection process is faster, because we have reduced the problem size from S to S'.

Resources