Create a 'perfect hash function' for contiguous ranges - algorithm

I'm looking for a way to 'project' a series of 'ranges' into a series of values. Applications would be histograms with uneven bins or creation of lookup tables.
An example:
0 to 14 => 0
15 to 234 => 1
235 => 2
236 to 255 => 3
The actual order of 'result' values (0, 1, 2, 3) on the right don't really matter as long as they are between 0 and 3, so that I can use a small lookup table after. It would be even better if I could make this work on floating point values (on the left).
I know I could use an 8-bit lookup table here and repeat values but I'd like to find a way to 'perfect hash' this: through a series of systematic operations (as small as possible, no branches) compute the right from the left values, to have the smallest possible result space.
I can't seem to find the correct series of magic Google incantations for this kind of algorithm.
The computing duration of this 'hash' or 'projection' function can be in the days if necessary.

If you have n ranges without overlap or gaps you can generate some simple code using O(log2(n)) instructions to to this lookup. I will demonstrate with some Python code.
# Must be a power of two, extend with zeros on the right if needed.
lookup = [0, 15, 235, 236]
def index(x):
i = 0
# ceil(log2(len(lookup))) iterations in the following pattern.
# We only need 2 iterations here.
# ...
# i += (x >= lookup[i+8]) << 4
# i += (x >= lookup[i+4]) << 3
i += (x >= lookup[i+2]) << 2
i += (x >= lookup[i+1]) << 1
return i
This is known as a branchless binary search. E.g. even if you have 216 ranges, this would only require 32 additions and 16 table lookups, comparisons, and bitshifts to compute the exact index.

Related

Quick way to compute n-th sequence of bits of size b with k bits set?

I want to develop a way to be able to represent all combinations of b bits with k bits set (equal to 1). It needs to be a way that given an index, can get quickly the binary sequence related, and the other way around too. For instance, the tradicional approach which I thought would be to generate the numbers in order, like:
For b=4 and k=2:
0- 0011
1- 0101
2- 0110
3- 1001
4-1010
5-1100
If I am given the sequence '1010', I want to be able to quickly generate the number 4 as a response, and if I give the number 4, I want to be able to quickly generate the sequence '1010'. However I can't figure out a way to do these things without having to generate all the sequences that come before (or after).
It is not necessary to generate the sequences in that order, you could do 0-1001, 1-0110, 2-0011 and so on, but there has to be no repetition between 0 and the (combination of b choose k) - 1 and all sequences have to be represented.
How would you approach this? Is there a better algorithm than the one I'm using?
pkpnd's suggestion is on the right track, essentially process one digit at a time and if it's a 1, count the number of options that exist below it via standard combinatorics.
nCr() can be replaced by a table precomputation requiring O(n^2) storage/time. There may be another property you can exploit to reduce the number of nCr's you need to store by leveraging the absorption property along with the standard recursive formula.
Even with 1000's of bits, that table shouldn't be intractably large. Storing the answer also shouldn't be too bad, as 2^1000 is ~300 digits. If you meant hundreds of thousands, then that would be a different question. :)
import math
def nCr(n,r):
return math.factorial(n) // math.factorial(r) // math.factorial(n-r)
def get_index(value):
b = len(value)
k = sum(c == '1' for c in value)
count = 0
for digit in value:
b -= 1
if digit == '1':
if b >= k:
count += nCr(b, k)
k -= 1
return count
print(get_index('0011')) # 0
print(get_index('0101')) # 1
print(get_index('0110')) # 2
print(get_index('1001')) # 3
print(get_index('1010')) # 4
print(get_index('1100')) # 5
Nice question, btw.

Random Numbers based on the ANU Quantum Random Numbers Server

I have been asked to use the ANU Quantum Random Numbers Service to create random numbers and use Random.rand only as a fallback.
module QRandom
def next
RestClient.get('http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=1'){ |response, request, result, &block|
case response.code
when 200
_json=JSON.parse(response)
if _json["success"]==true && _json["data"]
_json["data"].first || Random.rand(65535)
else
Random.rand(65535) #fallback
end
else
puts response #log problem
Random.rand(65535) #fallback
end
}
end
end
Their API service gives me a number between 0-65535. In order to create a random for a bigger set, like a random number between 0-99999, I have to do the following:
(QRandom.next.to_f*(99999.to_f/65535)).round
This strikes me as the wrong way of doing, since if I were to use a service (quantum or not) that creates numbers from 0-3 and transpose them into space of 0-9999 I have a choice of 4 numbers that I always get. How can I use the service that produces numbers between 0-65535 to create random numbers for a larger number set?
Since 65535 is 1111111111111111 in binary, you can just think of the random number server as a source of random bits. The fact that it gives the bits to you in chunks of 16 is not important, since you can make multiple requests and you can also ignore certain bits from the response.
So after performing that abstraction, what we have now is a service that gives you a random bit (0 or 1) whenever you want it.
Figure out how many bits of randomness you need. Since you want a number between 0 and 99999, you just need to find a binary number that is all ones and is greater than or equal to 99999. Decimal 99999 is equal to binary 11000011010011111, which is 17 bits long, so you will need 17 bits of randomness.
Now get 17 bits of randomness from the service and assemble them into a binary number. The number will be between 0 and 2**17-1 (131071), and it will be evenly distributed. If the random number happens to be greater than 99999, then throw away the bits you have and try again. (The probability of needing to retry should be less than 50%.)
Eventually you will get a number between 0 and 99999, and this algorithm should give you a totally uniform distribution.
How about asking for more numbers? Using the length parameter of that API you can just ask for extra numbers and sum them so you get bigger numbers like you want.
http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=2
You can use inject for the sum and the modulo operation to make sure the number is not bigger than you want.
json["data"].inject(:+) % MAX_NUMBER
I made some other changes to your code like using SecureRandom instead of the regular Random. You can find the code here:
https://gist.github.com/matugm/bee45bfe637f0abf8f29#file-qrandom-rb
Think of the individual numbers you are getting as 16 bits of randomness. To make larger random numbers, you just need more bits. The tricky bit is figuring out how many bits is enough. For example, if you wanted to generate numbers from an absolutely fair distribution from 0 to 65000, then it should be pretty obvious that 16 bits are not enough; even though you have the range covered, some numbers will have twice the probability of being selected than others.
There are a couple of ways around this problem. Using Ruby's Bignum (technically that happens behind the scenes, it works well in Ruby because you won't overflow your Integer type) it is possible to use a method that simply collects more bits until the result of a division could never be ambiguous - i.e. the difference when adding more significant bits to the division you are doing could never change the result.
This what it might look like, using your QRandom.next method to fetch bits in batches of 16:
def QRandom.rand max
max = max.to_i # This approach requires integers
power = 1
sum = 0
loop do
sum = 2**16 * sum + QRandom.next
power *= 2**16
lower_bound = sum * max / power
break lower_bound if lower_bound == ( (sum + 1) * max ) / power
end
end
Because it costs you quite a bit to fetch random bits from your chosen source, you may benefit from taking this to the most efficient form possible, which is similar in principle to Arithmetic Coding and squeezes out the maximum possible entropy from your source whilst generating unbiased numbers in 0...max. You would need to implement a method QRandom.next_bits( num ) that returned an integer constructed from a bitstream buffer originating with your 16-bit numbers:
def QRandom.rand max
max = max.to_i # This approach requires integers
# I prefer this: start_bits = Math.log2( max ).floor
# But this also works (and avoids suggestions the algo uses FP):
start_bits = max.to_s(2).length
sum = QRandom.next_bits( start_bits )
power = 2 ** start_bits
# No need for fractional bits if max is power of 2
return sum if power == max
# Draw 1 bit at a time to resolve fractional powers of 2
loop do
lower_bound = (sum * max) / power
break lower_bound if lower_bound == ((sum + 1) * max)/ power
sum = 2 * sum + QRandom.next_bits(1) # 0 or 1
power *= 2
end
end
This is the most efficient use of bits from your source possible. It is always as efficient or better than re-try schemes. The expected number of bits used per call to QRandom.rand( max ) is 1 + Math.log2( max ) - i.e. on average this allows you to draw just over the fractional number of bits needed to represent your range.

Picking a random option, where each option has a different probability of being picked

Suppose that you are given three "options", A, B and C.
Your algorithm must pick and return a random one. For this, it is pretty simple to just put them in an array {A,B,C} and generate a random number (0, 1 or 2) which will be the index of the element in the array to be returned.
Now, there is a variation to this algorithm: Suppose that A has a 40% chance of being picked, B a 20%, and C a 40%. If that was the case, you could have a similar approach: generate an array {A,A,B,C,C} and have a random number (0, 1, 2, 3, 4) to pick the element to be returned.
That works. However, I feel that it is very inefficient. Imagine using this algorithm for a large amount of options. You would be creating a somewhat big array, maybe with 100 elements representing a 1% each. Now, that's still not quite big, but supposing that your algorithm is used many times per second, this could be troublesome.
I've considered making a class called Slot, which has two properties: .value and .size. One slot is created for each option, where the .value property is the value of the option, and the .size one is the equivalent to the amount of occurrences of such option in the array. Then generate a random number from 0 to the total amount of occurrences and check on what slot did the number fall on.
I'm more concerned about the algorithm, but here is my Ruby attempt on this:
class Slot
attr_accessor :value
attr_accessor :size
def initialize(value,size)
#value = value
#size = size
end
end
def picker(options)
slots = []
totalSize = 0
options.each do |value,size|
slots << Slot.new(value,size)
totalSize += size
end
pick = rand(totalSize)
currentStack = 0
slots.each do |slot|
if (pick <= currentStack + slot.size)
return slot.value
else
currentStack += slot.size
end
end
return nil
end
50.times do
print picker({"A" => 40, "B" => 20, "C" => 40})
end
Which outputs:
CCCCACCCCAAACABAAACACACCCAABACABABACBAAACACCBACAAB
Is there a more efficient way to implement an algorithm that picks a random option, where each option has a different probability of being picked?
The simplest way is probably to write a case statement:
def get_random()
case rand(100) + 1
when 1..50 then 'A'
when 50..75 then 'B'
when 75..100 then 'C'
end
end
The problem with that is that you cannot pass any options, so you can write a function like this if you want it to be able to take options. The one below is very much like the one you wrote, but a bit shorter:
def picker(options)
current, max = 0, options.values.inject(:+)
random_value = rand(max) + 1
options.each do |key,val|
current += val
return key if random_value <= current
end
end
# A with 25% prob, B with 75%.
50.times do
print picker({"A" => 1, "B" => 3})
end
# => BBBBBBBBBABBABABBBBBBBBABBBBABBBBBABBBBBBABBBBBBBA
# If you add upp to 100, the number represent percentage.
50.times do
print picker({"A" => 40, "T" => 30, "C" => 20, "G" => 10})
end
# => GAAAATATTGTACCTCAATCCAGATACCTTAAGACCATTAAATCTTTACT
As a first approximation to a more efficient algorithm, if you compute the cumulative distribution function (which is just one pass over the distribution function, computing a running sum), then you can find the position of the randomly chosen integer using a binary search instead of a linear search. This will help if you have a lot of options, since it reduces the search time from O(#options) to O(log #options).
There is an O(1) solution, though. Here's the basic outline.
Let's say we have N options, 1...N, with weights ω1...ωN, where all of the ω values are at least 0. For simplicity, we scale the weights so their mean is 1, or in other words, their sum is N. (We just multiply them by N/Σω. We don't actually have to do this, but it makes the next couple of paragraphs easier to type without MathJax.)
Now, create a vector of N elements, where each element has a two option identifiers (lo and hi) and a cutoff p. The option identifiers are just integers 1...N, and p will be computed as a real number in the range (0, 1.0) inclusive.
We proceed to fill in the vector as follows. For each element i in turn:
If some ωj is exactly 1.0, then we set:
loi = j
hii = j
pi = 1.0
And we remove ωj from the list of weights.
Otherwise, there must be some ωj < 1.0 and some ωk > 1.0. (That's because the average weight is 1.0, and none of them have the average value. Some some of them must have less and some of them more, because it is impossible for all elements to be greater than the average or all elements to be less than the average.) Now, we set:
loi = j
hii = k
pi = ωj
ωk = ωk - (1 - ωj)
And once again, we remove ωj from the weights.
Note that in both cases, we have removed one weight, and we have reduced the sum of the weights by 1.0. So the average weight is still 1.0.
We continue in this fashion until the entire vector is filled. (The last element will have p = 1.0).
Given this vector, we can select a weighted random option as follows:
Generate a random integer i in the range 1...N and a random floating point value r in the range (0, 1.0]. If r < pi then we select option loi; otherwise, we select option hii.
It should be clear why this works from the construction of the vector. The weights of each above-average-weight option are distributed amongst the various vector elements, while each below-average-weight option is assigned to one part of some vector element with a corresponding probability of selection.
In a real implementation, we would map the range of weights onto integer values, and make the total weights close to the maximum integer (it has to be a multiple of N, so there will be some slosh.) We can then select a slot and select the weight inside the slot from a single random integer. In fact, we can modify the algorithm to avoid the division by forcing the number of slots to be a power of 2 by adding some 0-weighted options. Because the integer arithmetic will not work out perfectly, a bit of fiddling around will be necessary, but the end result can be made to be statistically correct, modulo the characteristics of the PRNG being used, and it will execute almost as fast as a simple unweighted selection of N options (one shift and a couple of comparisons extra), at the cost of a vector occupying less than 6N storage elements (counting the possibility of having to almost double the number of slots).
While this is not a direct answer I will show you a source for help you outline this problem: http://www.av8n.com/physics/arbitrary-probability.htm.
Edit:
Just found a nice source in ruby for that, pickup gem.
require 'pickup'
headings = {
A: 40,
B: 20,
C: 40,
}
pickup = Pickup.new(headings)
pickup.pick
#=> A
pickup.pick
#=> B
pickup.pick
#=> A
pickup.pick
#=> C
pickup.pick
#=> C

Compute rank of a combination?

I want to pre-compute some values for each combination in a set of combinations. For example, when choosing 3 numbers from 0 to 12, I'll compute some value for each one:
>>> for n in choose(range(13), 3):
print n, foo(n)
(0, 1, 2) 78
(0, 1, 3) 4
(0, 1, 4) 64
(0, 1, 5) 33
(0, 1, 6) 20
(0, 1, 7) 64
(0, 1, 8) 13
(0, 1, 9) 24
(0, 1, 10) 85
(0, 1, 11) 13
etc...
I want to store these values in an array so that given the combination, I can compute its and get the value. For example:
>>> a = [78, 4, 64, 33]
>>> a[magic((0,1,2))]
78
What would magic be?
Initially I thought to just store it as a 3-d matrix of size 13 x 13 x 13, so I can easily index it that way. While this is fine for 13 choose 3, this would have way too much overhead for something like 13 choose 7.
I don't want to use a dict because eventually this code will be in C, and an array would be much more efficient anyway.
UPDATE: I also have a similar problem, but using combinations with repetitions, so any answers on how to get the rank of those would be much appreciated =).
UPDATE: To make it clear, I'm trying to conserve space. Each of these combinations actually indexes into something take up a lot of space, let's say 2 kilobytes. If I were to use a 13x13x13 array, that would be 4 megabytes, of which I only need 572 kilobytes using (13 choose 3) spots.
Here is a conceptual answer and a code based on how lex ordering works. (So I guess my answer is like that of "moron", except that I think that he has too few details and his links have too many.) I wrote a function unchoose(n,S) for you that works assuming that S is an ordered list subset of range(n). The idea: Either S contains 0 or it does not. If it does, remove 0 and compute the index for the remaining subset. If it does not, then it comes after the binomial(n-1,k-1) subsets that do contain 0.
def binomial(n,k):
if n < 0 or k < 0 or k > n: return 0
b = 1
for i in xrange(k): b = b*(n-i)/(i+1)
return b
def unchoose(n,S):
k = len(S)
if k == 0 or k == n: return 0
j = S[0]
if k == 1: return j
S = [x-1 for x in S]
if not j: return unchoose(n-1,S[1:])
return binomial(n-1,k-1)+unchoose(n-1,S)
def choose(X,k):
n = len(X)
if k < 0 or k > n: return []
if not k: return [[]]
if k == n: return [X]
return [X[:1] + S for S in choose(X[1:],k-1)] + choose(X[1:],k)
(n,k) = (13,3)
for S in choose(range(n),k): print unchoose(n,S),S
Now, it is also true that you can cache or hash values of both functions, binomial and unchoose. And what's nice about this is that you can compromise between precomputing everything and precomputing nothing. For instance you can precompute only for len(S) <= 3.
You can also optimize unchoose so that it adds the binomial coefficients with a loop if S[0] > 0, instead of decrementing and using tail recursion.
You can try using the lexicographic index of the combination. Maybe this page will help: http://saliu.com/bbs/messages/348.html
This MSDN page has more details: Generating the mth Lexicographical Element of a Mathematical Combination.
NOTE: The MSDN page has been retired. If you download the documentation at the above link, you will find the article on page 10201 of the pdf that is downloaded.
To be a bit more specific:
When treated as a tuple, you can order the combinations lexicographically.
So (0,1,2) < (0,1,3) < (0,1,4) etc.
Say you had the number 0 to n-1 and chose k out of those.
Now if the first element is zero, you know that it is one among the first n-1 choose k-1.
If the first element is 1, then it is one among the next n-2 choose k-1.
This way you can recursively compute the exact position of the given combination in the lexicographic ordering and use that to map it to your number.
This works in reverse too and the MSDN page explains how to do that.
Use a hash table to store the results. A decent hash function could be something like:
h(x) = (x1*p^(k - 1) + x2*p^(k - 2) + ... + xk*p^0) % pp
Where x1 ... xk are the numbers in your combination (for example (0, 1, 2) has x1 = 0, x2 = 1, x3 = 2) and p and pp are primes.
So you would store Hash[h(0, 1, 2)] = 78 and then you would retrieve it the same way.
Note: the hash table is just an array of size pp, not a dict.
I would suggest a specialised hash table. The hash for a combination should be the exclusive-or of the hashes for the values. Hashes for values are basically random bit-patterns.
You could code the table to cope with collisions, but it should be fairly easy to derive a minimal perfect hash scheme - one where no two three-item combinations give the same hash value, and where the hash-size and table-size are kept to a minimum.
This is basically Zobrist hashing - think of a "move" as adding or removing one item of the combination.
EDIT
The reason to use a hash table is that the lookup performance O(n) where n is the number of items in the combination (assuming no collisions). Calculating lexicographical indexes into the combinations is significantly slower, IIRC.
The downside is obviously the up-front work done to generate the table.
For now, I've reached a compromise: I have a 13x13x13 array which just maps to the index of the combination, taking up 13x13x13x2 bytes = 4 kilobytes (using short ints), plus the normal-sized (13 choose 3) * 2 kilobytes = 572 kilobytes, for a total of 576 kilobytes. Much better than 4 megabytes, and also faster than a rank calculation!
I did this partly cause I couldn't seem to get Moron's answer to work. Also this is more extensible - I have a case where I need combinations with repetitions, and I haven't found a way to compute the rank of those, yet.
What you want are called combinadics. Here's my implementation of this concept, in Python:
def nthresh(k, idx):
"""Finds the largest value m such that C(m, k) <= idx."""
mk = k
while ncombs(mk, k) <= idx:
mk += 1
return mk - 1
def idx_to_set(k, idx):
ret = []
for i in range(k, 0, -1):
element = nthresh(i, idx)
ret.append(element)
idx -= ncombs(element, i)
return ret
def set_to_idx(input):
ret = 0
for k, ck in enumerate(sorted(input)):
ret += ncombs(ck, k + 1)
return ret
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration and it does not use very much memory. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to C++.

How can I randomly iterate through a large Range?

I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on #bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
Define the range to be from 0 to
some number N
Generate a random starting point x[0] inside N
Generate an iterator Q less than N
Generate successive points x[n] by adding Q to
the previous point and wrapping around if needed. That
is, x[n+1] = (x[n] + Q) % N
Repeat until you generate a new point equal to the starting point.
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.
This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.
Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
As #Turtle answered, you problem doesn't have a solution. #KandadaBoggu and #bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.
Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
puts "Oh, no!" if tried[x]
tried[x] = true
}
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
Break the range in to manageable batches as shown below:
def range_walker range, batch_size = 100
size = (range.end - range.begin) + 1
n = size/batch_size
n.times do |i|
x = i * batch_size + range.begin
y = x + batch_size
(x...y).sort_by{rand}.each{|z| p z}
end
d = (range.end - size%batch_size + 1)
(d..range.end).sort_by{rand}.each{|z| p z }
end
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
you can randomly iterate an array with shuffle method
a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
return (last_value + prime_number) % sample_size
}
If you call this like so:
sample = 10
For i = 1 to sample
last_value = fullCycleStep(sample, last_value)
print last_value
next
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
def gen_random_indices
# Assume your input range is (0..(10**3))
(0..3).sort_by{rand}.each do |a|
(0..3).sort_by{rand}.each do |b|
(0..3).sort_by{rand}.each do |c|
yield "#{a}#{b}#{c}".to_i
end
end
end
end
gen_random_indices do |idx|
run_test_with_index(idx)
end
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.
The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).
[Edit]: Taking into account #klew and #Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:
# partition a range into an array of N equal-sized ranges
def partition(range, n)
ranges = []
first = range.first
last = range.last
length = last - first + 1
step = length / n # integer division
((first + step - 1)..last).step(step) { |i|
ranges << (first..i)
first = i + 1
}
# append any extra onto the last element
ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
ranges
end
I hope the code comments help shed some light on my original question.
pastebin: full source
Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.
For a prohibitively large space, like
space = -10..1000000000000000000000
You can add this method to Range.
class Range
M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727
def each_random(seed = 0)
return to_enum(__method__) { size } unless block_given?
unless first.kind_of? Integer
raise TypeError, "can't randomly iterate from #{first.class}"
end
sample_size = self.end - first + 1
sample_size -= 1 if exclude_end?
j = coprime sample_size
v = seed % sample_size
each do
v = (v + j) % sample_size
yield first + v
end
end
protected
def gcd(a,b)
b == 0 ? a : gcd(b, a % b)
end
def coprime(a, z = M127)
gcd(a, z) == 1 ? z : coprime(a, z + 1)
end
end
You could then
space.each_random { |i| puts i }
729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to #nick-steele and #bta for the approach.
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.
So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.
The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
fun permute(i, length, seed) {
i = hash(i, 0xFFFF, seed)
while(i >= length): i = hash(i, 0xFFFF, seed)
return i
}
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
fun next_pow_2(length) {
# This implementation is for clarity.
# See Kensler's paper for one way to do it fast.
p = 1
while (p < length): p *= 2
return p
}
permute(i, length, seed) {
mask = next_pow_2(length)-1
i = hash(i, mask, seed) & mask
while(i >= length): i = hash(i, mask, seed) & mask
return i
}
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.

Resources