I know how to calculate the total number of combinations of n different objects taken k at a time, with replacement:
(n+k-1)!/k!/(n-1)!
What I need is a formula or algorithm to recover the i-th such combination from an ordered list.
Say I have an ordered list of all combinations of a,b,c taken 3 at a time (so n=3 and k=3):
1 aaa
2 aab
3 aac
4 abb
5 abc
6 acc
7 bbb
8 bbc
9 bcc
10 ccc
How would I calculate the i-th (say 7-th) combination in this list, without first enumerating them all ? Enumerating will be very inefficient for any but the simplest cases, if I am only interested in a few specific combinations. For instance, there are 119,877,472 combinations of 64 items taken 6 at a time.
Needless to say, I need a solution for arbitrary n, k and i.
The reverse function (given the combination, how to calculate its index) would also be interesting.
I found one similar question, but it was about permutations, not combinations:
I want to get a specific combination of permutation?
And there are many ways to list all the combinations, such as mentioned here:
How to generate all permutations and combinations with/without replacement for distinct items and non distinct items (multisets)
But they don't give the functions I need
The algorithm you are interested in is very easy to implement. The first thing you should understand is why actually C(k, n + k - 1) = C(n - 1, n + k - 1) = (n + k - 1)! / k! / (n - 1)! formula works. Formula says that the number of ways to take k items out of n is the same as to take n-k items out of n.
Lets say your objects are balls of some color. There are n different colors numbered from 1 to n. You need to calculate the number of ways to have k balls. Imagine initially k white balls (without any color) so you need to paint them in different ways. Arrange the balls in a row. Choose some k1 ≥ 0 balls from the left to paint in color #1, next k2 ≥ 0 balls we paint in #2, and so on... We have ∑ki = k. A series of k1 balls painted in color #1 is followed by k2 of color #2, next by k3 of color #3 etc...
We can do the same painting in a slightly different way however. In order to separate ki-1- and ki-colored balls we would use delimiters. In total we should have n - 1 such delimiters to be placed among the balls. The delimiters are ordered, one that separates 1-colored and 2-colored balls should appear before another that separates 2-colored and 3-colored. If some ki = 0 then corresponding delimiters appear one by one. We have to arrange delimiters and balls in some way.
Interestingly we can imagine now that both n - 1 delimiters and k balls are just objects initially placed in a row. We have to choose either n - 1 of them to declare selected objects to be delimiters or k objects to be balls. And that's where well-known combination formula can be applied.
Example for your case:
o - ball
. - delimiter
a, b, c - colors
We have:
ooo.. => aaa
oo.o. => aab
oo..o => aac
o.oo. => abb
o.o.o => abc
o..oo => acc
.ooo. => bbb
.oo.o => bbc
.o.oo => bcc
..ooo => ccc
Notice the pattern how delimiters move from right to left.
Algorithm
Now to the question of how to get the p-th arrangement. Efficient algorithm description follows. Remember that we have k balls and nd = n - 1 delimiters. We will be placing delimiters one by one first trying their rightmost positions. Consider leaving current delimiter at its current position, calculate the number of combinations to place the remaining objects to the right, let the number be some N. Compare N with p, if p is greater or equal to N then reduce p by N (p <- p - N) and we should move current delimiter left by 1. Else if p is lower than N then we will not move current delimiter but proceed to the next one trying to move it again from the rightmost position. Note that p-th arrangement is zero-based.
Having "converted" some i-th object to j-th delimiter we have N = C(nd - j, nd + k - i) number of ways to arrange remaining k - i + j balls and nd - j delimiters.
Since we'll often refer to binomial coefficients we'd better make their precalculation.
The reverse function may be implemented accordingly. You have positions for every delimiter. Accumulate the number of ways to arrange remaining objects while moving ordinary delimiter to its place from the rightmost position.
Example:
3 balls, 2 delimiters, find 7-th arrangement (which is bbc or .oo.o)
Place delimiters to the rightmost position: ooo... Let first delimiter be current.
Calculate N = C(1, 1) = 1, p ≥ N so we reduce p by N getting p = 6. At the same time we move current delimiter 1 pos left getting oo.o..
Calculate N = C(1, 2) = 2, p ≥ N, reduce p by N getting p = 6 - 2 = 4. Move getting o.oo..
Calculate N = C(1, 3) = 3, p ≥ N once again, move and reduce p getting p = 1 and .ooo..
Calculate N = C(1,4) = 4, p < N. Good, we've found final position for the first delimiter so leave it there and take second delimiter as current.
Calculate N = C(0,0) = 1, p ≥ N, p = 1 - 1 = 0, move, .oo.o.
Calculate N = C(0,1) = 1, p < N, found final position for the second delimiter. Resulting arrangement is .oo.o => bbc.
EDIT #1. Changed the algo description and added example.
here is the function (not optimized but working):
findcomb <- function(n, k, p) {
# n = nr of object types (colors, letters etc)
# k = number of objects (balls) to select
# p = 0-based index of target combination
# return = positions of delimiters at index p
nd <- n-1 #nr of delimiters: 1 - nr of colors
pos <- seq(n+k-nd, n+k-1) #original positions of delimiters, all at right
for (j in 1:(nd-1)) {
s <- 0 #cumulative nr of accounted-for combinations with this delimiter
while (TRUE) {
N <- choose(nd+k-pos[j], nd-j)
if (s + N <= p) {
pos[j] <- pos[j] - 1
s <- s + N
} else break
}
p <- p - s
}
#last delimiter:
pos[nd] <- pos[nd] - p
pos
}
Related
I'm stuck on this question on lintcode, and I've read two past solutions but neither of them make sense to me.
The question is as following:
There is a fence with n posts, each post can be painted with one of the k colors.
You have to paint all the posts such that no more than two adjacent fence posts have the same color.
Return the total number of ways you can paint the fence.
Given n =3, k=2, return 6.
So the part I do understand is that if n=0 (we have 0 posts) or k = 0(we have 0 paints), we can't paint anything so return 0
And if n == 1, the post can be painted in K ways so return k
When n is 2, we can paint it in K*1 ways if adjacent posts are equal and K*(K-1) ways if adjacent posts are different.
if n ==3 or more: Same adjacent colors would be: K * 1 * K-1 * 1 * K-1...
And different would be: K * K-1 * K-1 ....
How do I proceed from here? I've seen one guy create a matrix with [0, k, 2k, and 0] again and another guy simplify the "different colors" to (k+k*(k-1)) * (k-1) but I don't know how either of them jump to that step of their conclusion
edit: One guys solution is the following:
class Solution:
# #param {int} n non-negative integer, n posts
# #param {int} k non-negative integer, k colors
# #return {int} an integer, the total number of ways
def numWays(self, n, k):
# Write your code here
table = [0, k, k*k, 0]
if n <= 2:
return table[n]
# recurrence equation
# table[posts] = (color - 1) * (table[posts - 1] + table[posts - 2])
for i in range(3, n + 1):
table[3] = (k - 1) * (table[1] + table[2])
table[1], table[2] = table[2], table[3]
return table[3]
but I cant understand how why he has [0] at the end of his table, and how he set up the recurrence equation
Most difficult part of this problem is setting up recursion. Let L be the function returning number of combinations given n posts and k colors. Then there are two cases to consider:
a. Adding two posts in same color:
L(n+2,k) = (k-1)*L(n,k)
b. Adding two posts in different colors:
L(n+1,k) = (k-1)*L(n,k)
which gives forumula:
L(n,k) = (k-1)*(L(n-1,k)+L(n-2,k))
Example
For n=3 and k=2, lets say we know the number of combinations with first two posts which are
n=1 | k = 2
n=2 | k*k = 4
Now the to solve for n=3 we need to use previuosly calculated values having adjacent n=2 and n=3
a. Different colors: by adding one post different than the trailing one, sum[2]*(k-1) = 4
b. Same color: stepping back one tile and adding two in the same color other than n=1, which gives sum[1]*(k-1) = 2
As for matrix its a matter of taste, vars like current and prev would be fine as well.
Following Dominik G's answer, one can give an explicit formula for
L(n,k) because for fixed k it is a
constant recursive sequence
My result is that for k >= 2, if D = sqrt((k+1)^2 - 4) and
u = (k-1+D)/2 and v = (k-1-D)/2 are the two solutions of
the quadratic equation x^2 = (k-1)(x + 1), then one has for n >= 1
L(n, k) = (k/(k-1))*((u^(n+1)-v^(n+1))/D
This makes a fast algorithm if one can compute with sufficient
floating point precision.
Hmm, I'm afraid latex formatting doesn't work here, but the formulas are easy to understand
You are given a rectangular grid with n rows and m columns. The rows are numbered 1 to n, from bottom to top, and the columns are numbered 1 to m, from left to right.
You are also given k special fields in the form (row, column). For each i, where 0 <= i <= k, count the number of different paths from (1, 1) to (n, m) that contains exactly n special fields.
There is one rule you must follow. You are only allowed to make moves that are straight up or to the right. In other words, from each field (row, column), you can only move to field (row+1, column) or field (row, column+1).
Output an array of k + 1 elements. The i-th element (0-indexed) must be the number of different paths that contain exactly i special fields. Since, the answer can be too big, output it modulo 1000007.
Input:
First line contains three space separated integers, n, m and k. Next k lines, each contain two space separated integers, the coordinates of a special field.
Output:
k + 1 space separated integers, the answer to the question.
Constraints:
1 <= n, m, k <= 100
For all coordinates (r, c) - 1 <= r <= n, 1 <= c <= m
All coordinates are valid and different.
This is a simple DP:
Initialization:
T[i][0][k] = 0
T[0][j][k] = 0
If grid[1][1] is not special:
T[1][1][k!=0] = 0
T[1][1][0] = 1
Otherwise:
T[1][1][k!=1] = 0
T[1][1][1] = 1
Bulk:
if grid[i][j] is not special:
T[i][j][k] = (T[i-1][j][k] + T[i][j-1][k]) % 1000007
Otherwise:
T[i][j][0] = 0
T[i][j][k>0] = (T[i-1][j][k-1] + T[i][j-1][k-1]) % 1000007
Answer:
T[n][m][k], for every possible k.
You are given N total number of item, P group in which you have to divide the N items.
Condition is the product of number of item held by each group should be max.
example N=10 and P=3 you can divide the 10 item in {3,4,3} since 3x3x4=36 max possible product.
You will want to form P groups of roughly N / P elements. However, this will not always be possible, as N might not be divisible by P, as is the case for your example.
So form groups of floor(N / P) elements initially. For your example, you'd form:
floor(10 / 3) = 3
=> groups = {3, 3, 3}
Now, take the remainder of the division of N by P:
10 mod 3 = 1
This means you have to distribute 1 more item to your groups (you can have up to P - 1 items left to distribute in general):
for i = 0 up to (N mod P) - 1:
groups[i]++
=> groups = {4, 3, 3} for your example
Which is also a valid solution.
For fun I worked out a proof of the fact that it in an optimal solution either all numbers = N/P or the numbers are some combination of floor(N/P) and ceiling(N/P). The proof is somewhat long, but proving optimality in a discrete context is seldom trivial. I would be interested if anybody can shorten the proof.
Lemma: For P = 2 the optimal way to divide N is into {N/2, N/2} if N is even and {floor(N/2), ceiling(N/2)} if N is odd.
This follows since the constraint that the two numbers sum to N means that the two numbers are of the form x, N-x.
The resulting product is (N-x)x = Nx - x^2. This is a parabola that opens down. Its max is at its vertex at x = N/2. If N is even this max is an integer. If N is odd, then x = N/2 is a fraction, but such parabolas are strictly unimodal, so the closer x gets to N/2 the larger the product. x = floor(N/2) (or ceiling, it doesn't matter by symmetry) is the closest an integer can get to N/2, hence {floor(N/2),ceiling(N/2)} is optimal for integers.
General case: First of all, a global max exists since there are only finitely many integer partitions and a finite list of numbers always has a max. Suppose that {x_1, x_2, ..., x_P} is globally optimal. Claim: given and i,j we have
|x_i - x_ j| <= 1
In other words: any two numbers in an optimal solution differ by at most 1. This follows immediately from the P = 2 lemma (applied to N = x_i + x_ j).
From this claim it follows that there are at most two distinct numbers among the x_i. If there is only 1 number, that number is clearly N/P. If there are two numbers, they are of the form a and a+1. Let k = the number of x_i which equal a+1, hence P-k of the x_i = a. Hence
(P-k)a + k(a+1) = N, where k is an integer with 1 <= k < P
But simple algebra yields that a = (N-k)/P = N/P - k/P.
Hence -- a is an integer < N/P which differs from N/P by less than 1 (k/P < 1)
Thus a = floor(N/P) and a+1 = ceiling(N/P).
QED
My input are three numbers - a number s and the beginning b and end e of a range with 0 <= s,b,e <= 10^1000. The task is to find the minimal Levenstein distance between s and all numbers in range [b, e]. It is not necessary to find the number minimizing the distance, the minimal distance is sufficient.
Obviously I have to read the numbers as string, because standard C++ type will not handle such large numbers. Calculating the Levenstein distance for every number in the possibly huge range is not feasible.
Any ideas?
[EDIT 10/8/2013: Some cases considered in the DP algorithm actually don't need to be considered after all, though considering them does not lead to incorrectness :)]
In the following I describe an algorithm that takes O(N^2) time, where N is the largest number of digits in any of b, e, or s. Since all these numbers are limited to 1000 digits, this means at most a few million basic operations, which will take milliseconds on any modern CPU.
Suppose s has n digits. In the following, "between" means "inclusive"; I will say "strictly between" if I mean "excluding its endpoints". Indices are 1-based. x[i] means the ith digit of x, so e.g. x[1] is its first digit.
Splitting up the problem
The first thing to do is to break up the problem into a series of subproblems in which each b and e have the same number of digits. Suppose e has k >= 0 more digits than s: break up the problem into k+1 subproblems. E.g. if b = 5 and e = 14032, create the following subproblems:
b = 5, e = 9
b = 10, e = 99
b = 100, e = 999
b = 1000, e = 9999
b = 10000, e = 14032
We can solve each of these subproblems, and take the minimum solution.
The easy cases: the middle
The easy cases are the ones in the middle. Whenever e has k >= 1 more digits than b, there will be k-1 subproblems (e.g. 3 above) in which b is a power of 10 and e is the next power of 10, minus 1. Suppose b is 10^m. Notice that choosing any digit between 1 and 9, followed by any m digits between 0 and 9, produces a number x that is in the range b <= x <= e. Furthermore there are no numbers in this range that cannot be produced this way. The minimum Levenshtein distance between s (or in fact any given length-n digit string that doesn't start with a 0) and any number x in the range 10^m <= x <= 10^(m+1)-1 is necessarily abs(m+1-n), since if m+1 >= n it's possible to simply choose the first n digits of x to be the same as those in s, and delete the remainder, and if m+1 < n then choose the first m+1 to be the same as those in s and insert the remainder.
In fact we can deal with all these subproblems in a single constant-time operation: if the smallest "easy" subproblem has b = 10^m and the largest "easy" subproblem has b = 10^u, then the minimum Levenshtein distance between s and any number in any of these ranges is m-n if n < m, n-u if n > u, and 0 otherwise.
The hard cases: the end(s)
The hard cases are when b and e are not restricted to have the form b = 10^m and e = 10^(m+1)-1 respectively. Any master problem can generate at most two subproblems like this: either two "ends" (resulting from a master problem in which b and e have different numbers of digits, such as the example at the top) or a single subproblem (i.e. the master problem itself, which didn't need to be subdivided at all because b and e already have the same number of digits). Note that due to the previous splitting of the problem, we can assume that the subproblem's b and e have the same number of digits, which we will call m.
Super-Levenshtein!
What we will do is design a variation of the Levenshtein DP matrix that calculates the minimum Levenshtein distance between a given digit string (s) and any number x in the range b <= x <= e. Despite this added "power", the algorithm will still run in O(n^2) time :)
First, observe that if b and e have the same number of digits and b != e, then it must be the case that they consist of some number q >= 0 of identical digits at the left, followed by a digit that is larger in e than in b. Now consider the following procedure for generating a random digit string x:
Set x to the first q digits of b.
Append a randomly-chosen digit d between b[i] and e[i] to x.
If d == b[i], we "hug" the lower bound:
For i from q+1 to m:
If b[i] == 9 then append b[i]. [EDIT 10/8/2013: Actually this can't happen, because we chose q so that e[i] will be larger then b[i], and there is no digit larger than 9!]
Otherwise, flip a coin:
Heads: Append b[i].
Tails: Append a randomly-chosen digit d > b[i], then goto 6.
Stop.
Else if d == e[i], we "hug" the upper bound:
For i from q+1 to m:
If e[i] == 0 then append e[i]. [EDIT 10/8/2013: Actually this can't happen, because we chose q so that b[i] will be smaller then e[i], and there is no digit smaller than 0!]
Otherwise, flip a coin:
Heads: Append e[i].
Tails: Append a randomly-chosen digit d < e[i], then goto 6.
Stop.
Otherwise (if d is strictly between b[i] and e[i]), drop through to step 6.
Keep appending randomly-chosen digits to x until it has m digits.
The basic idea is that after including all the digits that you must include, you can either "hug" the lower bound's digits for as long as you want, or "hug" the upper bound's digits for as long as you want, and as soon as you decide to stop "hugging", you can thereafter choose any digits you want. For suitable random choices, this procedure will generate all and only the numbers x such that b <= x <= e.
In the "usual" Levenshtein distance computation between two strings s and x, of lengths n and m respectively, we have a rectangular grid from (0, 0) to (n, m), and at each grid point (i, j) we record the Levenshtein distance between the prefix s[1..i] and the prefix x[1..j]. The score at (i, j) is calculated from the scores at (i-1, j), (i, j-1) and (i-1, j-1) using bottom-up dynamic programming. To adapt this to treat x as one of a set of possible strings (specifically, a digit string corresponding to a number between b and e) instead of a particular given string, what we need to do is record not one but two scores for each grid point: one for the case where we assume that the digit at position j was chosen to hug the lower bound, and one where we assume it was chosen to hug the upper bound. The 3rd possibility (step 5 above) doesn't actually require space in the DP matrix because we can work out the minimal Levenshtein distance for the entire rest of the input string immediately, very similar to the way we work it out for the "easy" subproblems in the first section.
Super-Levenshtein DP recursion
Call the overall minimal score at grid point (i, j) v(i, j). Let diff(a, b) = 1 if characters a and b are different, and 0 otherwise. Let inrange(a, b..c) be 1 if the character a is in the range b..c, and 0 otherwise. The calculations are:
# The best Lev distance overall between s[1..i] and x[1..j]
v(i, j) = min(hb(i, j), he(i, j))
# The best Lev distance between s[1..i] and x[1..j] obtainable by
# continuing to hug the lower bound
hb(i, j) = min(hb(i-1, j)+1, hb(i, j-1)+1, hb(i-1, j-1)+diff(s[i], b[j]))
# The best Lev distance between s[1..i] and x[1..j] obtainable by
# continuing to hug the upper bound
he(i, j) = min(he(i-1, j)+1, he(i, j-1)+1, he(i-1, j-1)+diff(s[i], e[j]))
At the point in time when v(i, j) is being calculated, we will also calculate the Levenshtein distance resulting from choosing to "stop hugging", i.e. by choosing a digit that is strictly in between b[j] and e[j] (if j == q) or (if j != q) is either above b[j] or below e[j], and thereafter freely choosing digits to make the suffix of x match the suffix of s as closely as possible:
# The best Lev distance possible between the ENTIRE STRINGS s and x, given that
# we choose to stop hugging at the jth digit of x, and have optimally aligned
# the first i digits of s to these j digits
sh(i, j) = if j >= q then shc(i, j)+abs(n-i-m+j)
else infinity
shc(i, j) = if j == q then
min(hb(i, j-1)+1, hb(i-1, j-1)+inrange(s[i], (b[j]+1)..(e[j]-1)))
else
min(hb(i, j-1)+1, hb(i-1, j-1)+inrange(s[i], (b[j]+1)..9),
he(i, j-1)+1, he(i-1, j-1)+inrange(s[i], (0..(e[j]-1)))
The formula for shc(i, j) doesn't need to consider "downward" moves, since such moves don't involve any digit choice for x.
The overall minimal Levenshtein distance is the minimum of v(n, m) and sh(i, j), for all 0 <= i <= n and 0 <= j <= m.
Complexity
Take N to be the largest number of digits in any of s, b or e. The original problem can be split in linear time into at most 1 set of easy problems that collectively takes O(1) time to solve and 2 hard subproblems that each take O(N^2) time to solve using the super-Levenshtein algorithm, so overall the problem can be solved in O(N^2) time, i.e. time proportional to the square of the number of digits.
A first idea to speed up the computation (works if |e-b| is not too large):
Question: how much can the Levestein distance change when we compare s with n and then with n+1?
Answer: not too much!
Let's see the dynamic-programming tables for s = 12007 and two consecutive n
n = 12296
0 1 2 3 4 5
1 0 1 2 3 4
2 1 0 1 2 3
3 2 1 1 2 3
4 3 2 2 2 3
5 4 3 3 3 3
and
n = 12297
0 1 2 3 4 5
1 0 1 2 3 4
2 1 0 1 2 3
3 2 1 1 2 3
4 3 2 2 2 3
5 4 3 3 3 2
As you can see, only the last column changes, since n and n+1 have the same digits, except for the last one.
If you have the dynamic-programming table for the edit-distance of s = 12001 and n = 12296, you already have the table for n = 12297, you just need to update the last column!
Obviously if n = 12299 then n+1 = 12300 and you need to update the last 3 columns of the previous table.. but this happens just once every 100 iteration.
In general, you have to
update the last column on every iterations (so, length(s) cells)
update the second-to-last too, once every 10 iterations
update the third-to-last, too, once every 100 iterations
so let L = length(s) and D = e-b. First you compute the edit-distance between s and b. Then you can find the minimum Levenstein distance over [b,e] looping over every integer in the interval. There are D of them, so the execution time is about:
Now since
we have an algorithm wich is
Here is an interesting programming puzzle I came across . Given an array of positive integers, and a number K. We need to find pairs(a,b) from the array such that a % b = K.
I have a naive O(n^2) solution to this where we can check for all pairs such that a%b=k. Works but inefficient. We can certainly do better than this can't we ? Any efficient algorithms for the same? Oh and it's NOT homework.
Sort your array and binary search or keep a hash table with the count of each value in your array.
For a number x, we can find the largest y such that x mod y = K as y = x - K. Binary search for this y or look it up in your hash and increment your count accordingly.
Now, this isn't necessarily the only value that will work. For example, 8 mod 6 = 8 mod 3 = 2. We have:
x mod y = K => x = q*y + K =>
=> x = q(x - K) + K =>
=> x = 1(x - K) + K =>
=> x = 2(x - K)/2 + K =>
=> ...
This means you will have to test all divisors of y as well. You can find the divisors in O(sqrt y), giving you a total complexity of O(n log n sqrt(max_value)) if using binary search and O(n sqrt(max_value)) with a hash table (recommended especially if your numbers aren't very large).
Treat the problem as having two separate arrays as input: one for the a numbers and a % b = K and one for the b numbers. I am going to assume that everything is >= 0.
First of all, you can discard any b <= K.
Now think of every number in b as generating a sequence K, K + b, K + 2b, K + 3b... You can record this using a pair of numbers (pos, b), where pos is incremented by b at each stage. Start with pos = 0.
Hold these sequences in a priority queue, so you can find the smallest pos value at any given time. Sort the array of a numbers - in fact you could do this ahead of time and discard any duplicates.
For each a number
While the smallest pos in the priority queue is <= a
Add the smallest multiple of b to it to make it >= a
If it is == a, you have a match
Update the stored value of pos for that sequence, re-ordering the priority queue
At worst, you end up comparing every number with every other number, which is the same as the simple solution, but with priority queue and sorting overhead. However, large values of b may remain unexamined in the priority queue while several a numbers pass through, in which case this does better - and if there are a lot of numbers to process and they are all different, some of them must be large.
This answer mentions the main points of an algorithm (called DL because it uses “divisor lists” ) and gives details via a program, called amodb.py.
Let B be the input array, containing N positive integers. Without much loss of generality, suppose B[i] > K for all i and that B is in ascending order. (Note that x%B[i] < K if B[i] < K; and where B[i] = K, one can report pairs (B[i], B[j]) for all j>i. If B is not sorted initially, charge a cost of O(N log N) to sort it.)
In algorithm DL and program amodb.py, A is an array with K pre-subtracted from the input array elements. Ie, A[i] = B[i] - K. Note that if a%b == K, then for some j we have a = b*j + K or a-K = b*j. That is, a%b == K iff a-K is a multiple of b. Moreover, if a-K = b*j and p is any factor of b, then p is a factor of a-K.
Let the prime numbers from 2 to 97 be called “small factors”. When N numbers are uniformly randomly selected from some interval [X,Y], on the order of N/ln(Y) of the numbers will have no small factors; a similar number will have a greatest small factor of 2; and declining proportions will have successively larger greatest small factors. For example, on the average about N/97 will be divisible by 97, about N/89-N/(89*97) by 89 but not 97, etc. Generally, when members of B are random, lists of members with certain greatest small factors or with no small factors are sub-O(N/ln(Y)) in length.
Given a list Bd containing members of B divisible by largest small factor p, DL tests each element of Bd against elements of list Ad, those elements of A divisible by p. But given a list Bp for elements of B without small factors, DL tests each of Bp's elements against all elements of A. Example: If N=25, p=13, Bd=[18967, 23231], and Ad=[12779, 162383], then DL tests if any of 12779%18967, 162383%18967, 12779%23231, 162383%23231 are zero. Note that it is possible to cut the number of tests in half in this example (and many others) by noticing 12779<18967, but amodb.py does not include that optimization.
DL makes J different lists for J different factors; in one version of amodb.py, J=25 and the factor set is primes less than 100. A larger value of J would increase the O(N*J) time to initialize divisor lists, but would slightly decrease the O(N*len(Bp)) time to process list Bp against elements of A. See results below. Time to process other lists is O((N/logY)*(N/logY)*J), which is in sharp contrast to the O(n*sqrt(Y)) complexity for a previous answer's method.
Shown next is output from two program runs. In each set, the first Found line is from a naïve O(N*N) test, and the second is from DL. (Note, both DL and the naïve method would run faster if too-small A values were progressively removed.) The time ratio in the last line of the first test shows a disappointingly low speedup ratio of 3.9 for DL vs naïve method. For that run, factors included only the 25 primes less than 100. For the second run, with better speedup of ~ 4.4, factors included numbers 2 through 13 and primes up to 100.
$ python amodb.py
N: 10000 K: 59685 X: 100000 Y: 1000000
Found 208 matches in 21.854 seconds
Found 208 matches in 5.598 seconds
21.854 / 5.598 = 3.904
$ python amodb.py
N: 10000 K: 97881 X: 100000 Y: 1000000
Found 207 matches in 21.234 seconds
Found 207 matches in 4.851 seconds
21.234 / 4.851 = 4.377
Program amodb.py:
import random, time
factors = [2,3,4,5,6,7,8,9,10,11,12,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]
X, N = 100000, 10000
Y, K = 10*X, random.randint(X/2,X)
print "N: ", N, " K: ", K, "X: ", X, " Y: ", Y
B = sorted([random.randint(X,Y) for i in range(N)])
NP = len(factors); NP1 = NP+1
A, Az, Bz = [], [[] for i in range(NP1)], [[] for i in range(NP1)]
t0 = time.time()
for b in B:
a, aj, bj = b-K, -1, -1
A.append(a) # Add a to A
for j,p in enumerate(factors):
if a % p == 0:
aj = j
Az[aj].append(a)
if b % p == 0:
bj = j
Bz[bj].append(b)
Bp = Bz.pop() # Get not-factored B-values list into Bp
di = time.time() - t0; t0 = time.time()
c = 0
for a in A:
for b in B:
if a%b == 0:
c += 1
dq = round(time.time() - t0, 3); t0 = time.time()
c=0
for i,Bd in enumerate(Bz):
Ad = Az[i]
for b in Bd:
for ak in Ad:
if ak % b == 0:
c += 1
for b in Bp:
for ak in A:
if ak % b == 0:
c += 1
dr = round(di + time.time() - t0, 3)
print "Found", c, " matches in", dq, "seconds"
print "Found", c, " matches in", dr, "seconds"
print dq, "/", dr, "=", round(dq/dr, 3)