Cryptographic Shuffle and Pick - random

I'm implementing a cryptographically secure shuffle routine and have a couple questions:
The method I'm using is a weighted sort where each weight is a cryptographically strong random number.
I'm computing the number of bits required for each weight by using the number of items in the list (X) and plugging it into this formula = log10(X!) / log10(2). For example a 52 card deck would require log10(52!) / log10(2) = 225.58100312370276194634244437667 bits per weight. I'm always rounding this up since fractions of a bit cannot be represented. Am I correct in always rounding up or is that giving me too many bits?
Retrieving bits from a hardware rng is perhaps not so practical so bytes must be retrieved. As in the previous example 226 / 8 = 28.25, so we have 28 full bytes, and an extra byte to get the remaining 2 bits. What I'm doing is discarding the unused upper 6 bits of the last byte such that only 2 more bits are appended to the number. Am I correct in simply discarding these bits or am I destroying the entropy by doing that?
I am sorting by the (left padded, all uppercase, ASCII) hexadecimal strings of weights assigned to each number. This appears to produce the correct sort order. Are there any catches to sorting strings in this manner that I should be aware of?
I should be using a hardware rng which tests the entropy of the numbers it's generating, but I'm stuck using MS RNGCryptoServiceProvider. Are there better Cryptographic RNG's to use with .NET?
To "pick" a number from the cryptographically weighted and sorted list, I'm simply choosing index 0. Is there a better cryptographically random method of choosing an item in the list?
Let me know if I can help clarify or if this is the wrong site please let me know what a better site would be.
Here's my code if it helps illustrate what I'm talking about VB.NET Console application:
Imports System.Security.Cryptography
Module Module1
Public Class Ball
Public Weight As String
Public Value As Integer
Public Sub New(ByVal _Weight As String, ByVal _Value As Integer)
Weight = _Weight
Value = _Value
End Sub
End Class
Public Class BallComparer_Weight
Implements IComparer(Of Ball)
Public Function Compare(x As Ball, y As Ball) As Integer Implements System.Collections.Generic.IComparer(Of Ball).Compare
If x.Weight > y.Weight Then
Return 1
ElseIf x.Weight < y.Weight Then
Return -1
Else
Return 0
End If
End Function
End Class
Public Class BallComparer_Value
Implements IComparer(Of Ball)
Public Function Compare(x As Ball, y As Ball) As Integer Implements System.Collections.Generic.IComparer(Of Ball).Compare
If x.Value > y.Value Then
Return 1
ElseIf x.Value < y.Value Then
Return -1
Else
Return 0
End If
End Function
End Class
Public Function Weight(ByVal rng As RNGCryptoServiceProvider, ByVal bits As Integer) As String
' generate a "cryptographically" random string of length 'bits' (should be using hardware rng)
Dim remainder As Integer = bits Mod 8
Dim quotient As Integer = bits \ 8
Dim byteCount As Integer = quotient + If(remainder <> 0, 1, 0)
Dim bytes() As Byte = New Byte(byteCount - 1) {}
Dim result As String = String.Empty
rng.GetBytes(bytes)
For index As Integer = bytes.Length - 1 To 0 Step -1
If index = bytes.Length - 1 Then
' remove upper `remainder` bits from upper byte
Dim value As Byte = (bytes(0) << remainder) >> remainder
result &= value.ToString("X2")
Else
result &= bytes(index).ToString("X2")
End If
Next
Return result
End Function
Public Function ContainsValue(ByVal lst As List(Of Ball), ByVal value As Integer) As Boolean
For i As Integer = 0 To lst.Count - 1
If lst(i).Value = value Then
Return True
End If
Next
Return False
End Function
Sub Main()
Dim valueComparer As New BallComparer_Value()
Dim weightComparer As New BallComparer_Weight()
Dim picks As New List(Of Ball)
Dim balls As New List(Of Ball)
' number of bits after each "ball" is drawn
Dim bits() As Integer = New Integer() {364, 358, 351, 345, 339}
Using rng As New RNGCryptoServiceProvider
While True
picks.Clear()
' simulate random balls
'log10(75!) / log10(2) = number of bits required for weighted random shuffle (reduces each time ball is pulled) = 363.40103411549404253061653790169 = 364
For i As Integer = 0 To 4
balls.Clear()
For value As Integer = 1 To 75
' do not add previous picks
If Not ContainsValue(picks, value) Then
balls.Add(New Ball(Weight(rng, bits(i)), value))
End If
Next
balls.Sort(weightComparer)
'For Each x As Ball In balls
' Console.WriteLine(x.Weight)
'Next
'Console.ReadLine()
' choose first ball in sorted list
picks.Add(balls(0))
Next
picks.Sort(valueComparer)
' simulate random balls
'log10(15!) / log10(2) = number of bits required for weighted random shuffle = 40.250140469882621763813506287601 = 41 bits required for megaball
balls.Clear()
For value As Integer = 1 To 15
balls.Add(New Ball(Weight(rng, 41), value))
Next
balls.Sort(weightComparer)
' print to stdout
For i As Integer = 0 To 4
Console.Write(picks(i).Value.ToString("D2") & " "c)
Next
Console.WriteLine(balls(0).Value.ToString("D2"))
End While
End Using
End Sub
End Module

Your basic idea seems sound. However:
You don't need that many bits in your weights. All you need is enough to make collisions unlikely, i.e. about ⌈log2 n2⌉ bits per item, plus a few for good measure. For 52 cards, the bare minimum is about 12 bits per card, and 16 bits will get the probability of a collision down to about 4%. That should be plenty, at least as long as you check for collisions explicitly.
You should check for collisions (i.e. two items having the same random sort key), and restart the shuffle if you find one. Alternatively, you can increase the length of the sort keys enough to make the probability of getting a collision negligibly small.
Yes, encoding the sort keys in hexadecimal should be OK. In fact, it doesn't really matter much how you encode them, as long as it's deterministic (i.e. always gives the same encoding for the same random number). That said, since you know the length of the random bitstrings, why not just store them in raw binary? (In particular, if you need less than 64 bits per key, you could just store each key in an appropriately sized integer variable.)
If you want to avoid side channel attacks, you should choose a sorting method that provably runs in constant time, and with constant power consumption, regardless of what the final order will be. This is easier said than done, since most common sorting algorithms are nowhere near constant time. That said, depending on your application, such attacks may or may not matter (but don't rule them out before you've thought about the issue!).
An alternative method of securely shuffling an array would be to use a Fisher–Yates shuffle with a cryptographically secure RNG. This method can be less wasteful of bits and easier to implement in constant time (or at least in time independent of the output; see below), but it does require your generator to be able to return unbiased samples from any integer range, not just from ranges with a power-of-two length. (Rejection sampling is one way to do this — it's not constant-time, but it can be shown that the time needed does not reveal anything about the eventual output, so it's still OK.)
Finally, if you only need one element from the shuffled array, all of this is unnecessary: just pick a random index to the array (uniformly, e.g. using the rejection sampling method mentioned above) and return the corresponding element.

Related

Random Numbers based on the ANU Quantum Random Numbers Server

I have been asked to use the ANU Quantum Random Numbers Service to create random numbers and use Random.rand only as a fallback.
module QRandom
def next
RestClient.get('http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=1'){ |response, request, result, &block|
case response.code
when 200
_json=JSON.parse(response)
if _json["success"]==true && _json["data"]
_json["data"].first || Random.rand(65535)
else
Random.rand(65535) #fallback
end
else
puts response #log problem
Random.rand(65535) #fallback
end
}
end
end
Their API service gives me a number between 0-65535. In order to create a random for a bigger set, like a random number between 0-99999, I have to do the following:
(QRandom.next.to_f*(99999.to_f/65535)).round
This strikes me as the wrong way of doing, since if I were to use a service (quantum or not) that creates numbers from 0-3 and transpose them into space of 0-9999 I have a choice of 4 numbers that I always get. How can I use the service that produces numbers between 0-65535 to create random numbers for a larger number set?
Since 65535 is 1111111111111111 in binary, you can just think of the random number server as a source of random bits. The fact that it gives the bits to you in chunks of 16 is not important, since you can make multiple requests and you can also ignore certain bits from the response.
So after performing that abstraction, what we have now is a service that gives you a random bit (0 or 1) whenever you want it.
Figure out how many bits of randomness you need. Since you want a number between 0 and 99999, you just need to find a binary number that is all ones and is greater than or equal to 99999. Decimal 99999 is equal to binary 11000011010011111, which is 17 bits long, so you will need 17 bits of randomness.
Now get 17 bits of randomness from the service and assemble them into a binary number. The number will be between 0 and 2**17-1 (131071), and it will be evenly distributed. If the random number happens to be greater than 99999, then throw away the bits you have and try again. (The probability of needing to retry should be less than 50%.)
Eventually you will get a number between 0 and 99999, and this algorithm should give you a totally uniform distribution.
How about asking for more numbers? Using the length parameter of that API you can just ask for extra numbers and sum them so you get bigger numbers like you want.
http://qrng.anu.edu.au/API/jsonI.php?type=uint16&length=2
You can use inject for the sum and the modulo operation to make sure the number is not bigger than you want.
json["data"].inject(:+) % MAX_NUMBER
I made some other changes to your code like using SecureRandom instead of the regular Random. You can find the code here:
https://gist.github.com/matugm/bee45bfe637f0abf8f29#file-qrandom-rb
Think of the individual numbers you are getting as 16 bits of randomness. To make larger random numbers, you just need more bits. The tricky bit is figuring out how many bits is enough. For example, if you wanted to generate numbers from an absolutely fair distribution from 0 to 65000, then it should be pretty obvious that 16 bits are not enough; even though you have the range covered, some numbers will have twice the probability of being selected than others.
There are a couple of ways around this problem. Using Ruby's Bignum (technically that happens behind the scenes, it works well in Ruby because you won't overflow your Integer type) it is possible to use a method that simply collects more bits until the result of a division could never be ambiguous - i.e. the difference when adding more significant bits to the division you are doing could never change the result.
This what it might look like, using your QRandom.next method to fetch bits in batches of 16:
def QRandom.rand max
max = max.to_i # This approach requires integers
power = 1
sum = 0
loop do
sum = 2**16 * sum + QRandom.next
power *= 2**16
lower_bound = sum * max / power
break lower_bound if lower_bound == ( (sum + 1) * max ) / power
end
end
Because it costs you quite a bit to fetch random bits from your chosen source, you may benefit from taking this to the most efficient form possible, which is similar in principle to Arithmetic Coding and squeezes out the maximum possible entropy from your source whilst generating unbiased numbers in 0...max. You would need to implement a method QRandom.next_bits( num ) that returned an integer constructed from a bitstream buffer originating with your 16-bit numbers:
def QRandom.rand max
max = max.to_i # This approach requires integers
# I prefer this: start_bits = Math.log2( max ).floor
# But this also works (and avoids suggestions the algo uses FP):
start_bits = max.to_s(2).length
sum = QRandom.next_bits( start_bits )
power = 2 ** start_bits
# No need for fractional bits if max is power of 2
return sum if power == max
# Draw 1 bit at a time to resolve fractional powers of 2
loop do
lower_bound = (sum * max) / power
break lower_bound if lower_bound == ((sum + 1) * max)/ power
sum = 2 * sum + QRandom.next_bits(1) # 0 or 1
power *= 2
end
end
This is the most efficient use of bits from your source possible. It is always as efficient or better than re-try schemes. The expected number of bits used per call to QRandom.rand( max ) is 1 + Math.log2( max ) - i.e. on average this allows you to draw just over the fractional number of bits needed to represent your range.

In random draw: how to insure that a value is not re-drawn too soon

When drawing in random from a set of values in succession, where a drawn value is allowed to
be drawn again, a given value has (of course) a small chance of being drawn twice (or more) in immediate succession, but that causes an issue (for the purposes of a given application) and we would like to eliminate this chance. Any algorithmic ideas on how to do so (simple/efficient)?
Ideally we would like to set a threshold say as a percentage of the size of the data set:
Say the size of the set of values N=100, and the threshold T=10%, then if a given value is drawn in the current draw, it is guaranteed not to show up again in the next N*T=10 draws.
Obviously this restriction introduces bias in the random selection. We don't mind that a
proposed algorithm introduces further bias into the randomness of selection, what really
matters for this application is that the selection is just random enough to appear so
for a human observer.
As an implementation detail, the values are stored as database records, so database table flags/values can be used, or maybe external memory structures. Answers about the abstract case are welcome too.
Edit:
I just hit this other SO question here, which has good overlap with my own. Going through the good points there.
Here's an implementation that does the whole process in O(1) (for a single element) without any bias:
The idea is to treat the last K elements in the array A (which contains all the values) like a queue, we draw a value from the first N-k values in A, which is the random value, and swap it with an element in position N-Pointer, when Pointer represents the head of the queue, and it resets to 1 when it crosses K elements.
To eliminate any bias in the first K draws, the random value will be drawn between 1 and N-Pointer instead of N-k, so this virtual queue is growing in size at each draw until reaching the size of K (e.g. after 3 draws the number of possible values appear in A between indexes 1 and N-3, and the suspended values appear in indexes N-2 to N.
All operations are O(1) for drawing a single elemnent and there's no bias throughout the entire process.
void DrawNumbers(val[] A, int K)
{
N = A.size;
random Rnd = new random;
int Drawn_Index;
int Count_To_K = 1;
int Pointer = K;
while (stop_drawing_condition)
{
if (Count_To_K <= K)
{
Drawn_Index = Rnd.NextInteger(1, N-Pointer);
Count_To_K++;
}
else
{
Drawn_Index = Rnd.NextInteger(1, N-K)
}
Print("drawn value is: " + A[Drawn_Index])
Swap(A[Drawn_Index], A[N-Pointer])
Pointer--;
if (Pointer < 1) Pointer = K;
}
}
My previous suggestion, by using a list and an actual queue, is dependent on the remove method of the list, which I believe can be at best O(logN) by using an array to implement a self balancing binary tree, as the list has to have direct access to indexes.
void DrawNumbers(list N, int K)
{
queue Suspended_Values = new queue;
random Rnd = new random;
int Drawn_Index;
while (stop_drawing_condition)
{
if (Suspended_Values.count == K)
N.add(Suspended_Value.Dequeue());
Drawn_Index = Rnd.NextInteger(1, N.size) // random integer between 1 and the number of values in N
Print("drawn value is: " + N[Drawn_Index]);
Suspended_Values.Enqueue(N[Drawn_Index]);
N.Remove(Drawn_Index);
}
}
I assume you have an array, A, that contains the items you want to draw. At each time period you randomly select an item from A.
You want to prevent any given item, i, from being drawn again within some k iterations.
Let's say that your threshold is 10% of A.
So create a queue, call it drawn, that can hold threshold items. Also create a hash table that contains the drawn items. Call the hash table hash.
Then:
do
{
i = Get random item from A
if (i in hash)
{
// we have drawn this item recently. Don't draw it.
continue;
}
draw(i);
if (drawn.count == k)
{
// remove oldest item from queue
temp = drawn.dequeue();
// and from the hash table
hash.remove(temp);
}
// add new item to queue and hash table
drawn.enqueue(i);
hash.add(i);
} while (forever);
The hash table exists solely to increase lookup speed. You could do without the hash table if you're willing to do a sequential search of the queue to determine if an item has been drawn recently.
Say you have n items in your list, and you don't want any of the k last items to be selected.
Select at random from an array of size n-k, and use a queue of size k to stick the items you don't want to draw (adding to the front and removing from the back).
All operations are O(1).
---- clarification ----
Give n items, and a goal of not redrawing any of the last k draws, create an array and queue as follows.
Create an array A of size n-k, and put n-k of your items in the list (chosen at random, or seeded however you like).
Create a queue (linked list) Q and populate it with the remaining k items, again in random order or whatever order you like.
Now, each time you want to select an item at random:
Choose a random index from your array, call this i.
Give A[i] to whomever is asking for it, and add it to the front of Q.
Remove the element from the back of Q, and store it in A[i].
Everything is O(1) after the array and linked list are created, which is a one-time O(n) operation.
Now, you might wonder, what do we do if we want to change n (i.e. add or remove an element).
Each time we add an element, we either want to grow the size of A or of Q, depending on our logic for deciding what k is (i.e. fixed value, fixed fraction of n, whatever...).
If Q increases then the result is trivial, we just append the new element to Q. In this case I'd probably append it to the end of Q so that it gets in play ASAP. You could also put it in A, kicking some element out of A and appending it to the end of Q.
If A increases, you can use a standard technique for increasing arrays in amortized constant time. E.g., each time A fills up, we double it in size, and keep track of the number of cells of A that are live. (look up 'Dynamic Arrays' in Wikipedia if this is unfamiliar).
Set-based approach:
If the threshold is low (say below 40%), the suggested approach is:
Have a set and a queue of the last N*T generated values.
When generating a value, keep regenerating it until it's not contained in the set.
When pushing to the queue, pop the oldest value and remove it from the set.
Pseudo-code:
generateNextValue:
// once we're generated more than N*T elements,
// we need to start removing old elements
if queue.size >= N*T
element = queue.pop
set.remove(element)
// keep trying to generate random values until it's not contained in the set
do
value = getRandomValue()
while set.contains(value)
set.add(value)
queue.push(value)
return value
If the threshold is high, you can just turn the above on its head:
Have the set represent all values not in the last N*T generated values.
Invert all set operations (replace all set adds with removes and vice versa and replace the contains with !contains).
Pseudo-code:
generateNextValue:
if queue.size >= N*T
element = queue.pop
set.add(element)
// we can now just get a random value from the set, as it contains all candidates,
// rather than generating random values until we find one that works
value = getRandomValueFromSet()
//do
// value = getRandomValue()
//while !set.contains(value)
set.remove(value)
queue.push(value)
return value
Shuffled-based approach: (somewhat more complicated that the above)
If the threshold is a high, the above may take long, as it could keep generating values that already exists.
In this case, some shuffle-based approach may be a better idea.
Shuffle the data.
Repeatedly process the first element.
When doing so, remove it and insert it back at a random position in the range [N*T, N].
Example:
Let's say N*T = 5 and all possible values are [1,2,3,4,5,6,7,8,9,10].
Then we first shuffle, giving us, let's say, [4,3,8,9,2,6,7,1,10,5].
Then we remove 4 and insert it back in some index in the range [5,10] (say at index 5).
Then we have [3,8,9,2,4,6,7,1,10,5].
And continue removing the next element and insert it back, as required.
Implementation:
An array is fine if we don't care about efficient a whole lot - to get one element will cost O(n) time.
To make this efficient we need to use an ordered data structure that supports efficient random position inserts and first position removals. The first thing that comes to mind is a (self-balancing) binary search tree, ordered by index.
We won't be storing the actual index, the index will be implicitly defined by the structure of the tree.
At each node we will have a count of children (+ 1 for itself) (which needs to be updated on insert / remove).
An insert can be done as follows: (ignoring the self-balancing part for the moment)
// calling function
insert(node, value)
insert(node, N*T, value)
insert(node, offset, value)
// node.left / node.right can be defined as 0 if the child doesn't exist
leftCount = node.left.count - offset
rightCount = node.right.count
// Since we're here, it means we're inserting in this subtree,
// thus update the count
node.count++
// Nodes to the left are within N*T, so simply go right
// leftCount is the difference between N*T and the number of nodes on the left,
// so this needs to be the new offset (and +1 for the current node)
if leftCount < 0
insert(node.right, -leftCount+1, value)
else
// generate a random number,
// on [0, leftCount), insert to the left
// on [leftCount, leftCount], insert at the current node
// on (leftCount, leftCount + rightCount], insert to the right
sum = leftCount + rightCount + 1
random = getRandomNumberInRange(0, sum)
if random < leftCount
insert(node.left, offset, value)
else if random == leftCount
// we don't actually want to update the count here
node.count--
newNode = new Node(value)
newNode.count = node.count + 1
// TODO: swap node and newNode's data so that node's parent will now point to newNode
newNode.right = node
newNode.left = null
else
insert(node.right, -leftCount+1, value)
To visualize inserting at the current node:
If we have something like:
4
/
1
/ \
2 3
And we want to insert 5 where 1 is now, it will do this:
4
/
5
\
1
/ \
2 3
Note that when a red-black tree, for example, performs operations to keep itself balanced, none of these involve comparisons, so it doesn't need to know the order (i.e. index) of any already-inserted elements. But it will have to update the counts appropriately.
The overall efficiency will be O(log n) to get one element.
I'd put all "values" into a "list" of size N, then shuffle the list and retrieve values from the top of the list. Then you "insert" the retrieved value at a random position with any index >= N*T.
Unfortunately I'm not truly a math-guy :( So I simply tried it (in VB, so please take it as pseudocode ;) )
Public Class BiasedRandom
Private prng As New Random
Private offset As Integer
Private l As New List(Of Integer)
Public Sub New(ByVal size As Integer, ByVal threshold As Double)
If threshold <= 0 OrElse threshold >= 1 OrElse size < 1 Then Throw New System.ArgumentException("Check your params!")
offset = size * threshold
' initial fill
For i = 0 To size - 1
l.Add(i)
Next
' shuffle "Algorithm p"
For i = size - 1 To 1 Step -1
Dim j = prng.Next(0, i + 1)
Dim tmp = l(i)
l(i) = l(j)
l(j) = tmp
Next
End Sub
Public Function NextValue() As Integer
Dim tmp = l(0)
l.RemoveAt(0)
l.Insert(prng.Next(offset, l.Count + 1), tmp)
Return tmp
End Function
End Class
Then a simple check:
Public Class Form1
Dim z As Integer = 10
Dim k As BiasedRandom
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
k = New BiasedRandom(z, 0.5)
End Sub
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim j(z - 1)
For i = 1 To 10 * 1000 * 1000
j(k.NextValue) += 1
Next
Stop
End Sub
End Class
And when I check out the distribution it looks okay enough for an unarmed eye ;)
EDIT:
After thinking about RonTeller's argumentation, I have to admit that he is right. I don't think that there is a performance friendly way to achieve the wanted and to pertain a good (not more biased than required) random order.
I came to the follwoing idea:
Given a list (array whatever) like this:
0123456789 ' not shuffled to make clear what I mean
We return the first element which is 0. This one must not come up again for 4 (as an example) more draws but we also want to avoid a strong bias. Why not simply put it to the end of the list and then shuffle the "tail" of the list, i.e. the last 6 elements?
1234695807
We now return the 1 and repeat the above steps.
2340519786
And so on and so on. Since removing and inserting is kind of unnecessary work, one could use a simple array and a "pointer" to the actual element. I have changed the code from above to give a sample. It's slower than the first one, but should avoid the mentioned bias.
Public Function NextValue() As Integer
Static current As Integer = 0
' only shuffling a part of the list
For i = current + l.Count - 1 To current + 1 + offset Step -1
Dim j = prng.Next(current + offset, i + 1)
Dim tmp = l(i Mod l.Count)
l(i Mod l.Count) = l(j Mod l.Count)
l(j Mod l.Count) = tmp
Next
current += 1
Return l((current - 1) Mod l.Count)
End Function
EDIT 2:
Finally (hopefully), I think the solution is quite simple. The below code assumes that there is an array of N elements called TheArray which contains the elements in random order (could be rewritten to work with sorted array). The value DelaySize determines how long a value should be suspended after it has been drawn.
Public Function NextValue() As Integer
Static current As Integer = 0
Dim SelectIndex As Integer = prng.Next(0, TheArray.Count - DelaySize)
Dim ReturnValue = TheArray(SelectIndex)
TheArray(SelectIndex) = TheArray(TheArray.Count - 1 - current Mod DelaySize)
TheArray(TheArray.Count - 1 - current Mod DelaySize) = ReturnValue
current += 1
Return ReturnValue
End Function

Finding if a random number has occured before or not

Let me be clear at start that this is a contrived example and not a real world problem.
If I have a problem of creating a random number between 0 to 10. I do this 11 times making sure that a previously occurred number is not drawn again, if I get a repeated number,
I create another random number again to make sure it has not be seen earlier. So essentially I get a a sequence of unique numbers from 0 - 10 in a random order
e.g. 3 1 2 0 5 9 4 8 10 6 7 and so on
Now to come up with logic to make sure that the random numbers are unique and not one which we have drawn before, we could use many approaches
Use C++ std::bitset and set the bit corresponding to the index equal to value of each random no. and check it next time when a new random number is drawn.
Or
Use a std::map<int,int> to count the number of times or even simple C array with some sentinel values stored in that array to indicate if that number has occurred or not.
If I have to avoid these methods above and use some mathematical/logical/bitwise operation to find whether a random number has been draw before or not, is there a way?
You don't want to do it the way you suggest. Consider what happens when you have already selected 10 of the 11 items; your random number generator will cycle until it finds the missing number, which might be never, depending on your random number generator.
A better solution is to create a list of numbers 0 to 10 in order, then shuffle the list into a random order. The normal algorithm for doing this is due to Knuth, Fisher and Yates: starting at the first element, swap each element with an element at a location greater than the current element in the array.
function shuffle(a, n)
for i from n-1 to 1 step -1
j = randint(i)
swap(a[i], a[j])
We assume an array with indices 0 to n-1, and a randint function that sets j to the range 0 <= j <= i.
Use an array and add all possible values to it. Then pick one out of the array and remove it. Next time, pick again until the array is empty.
Yes, there is a mathematical way to do it, but it is a bit expansive.
have an array: primes[] where primes[i] = the i'th prime number. So its beginning will be [2,3,5,7,11,...].
Also store a number mult Now, once you draw a number (let it be i) you check if mult % primes[i] == 0, if it is - the number was drawn before, if it wasn't - then the number was not. chose it and do mult = mult * primes[i].
However, it is expansive because it might require a lot of space for large ranges (the possible values of mult increases exponentially
(This is a nice mathematical approach, because we actually look at a set of primes p_i, the array of primes is only the implementation to the abstract set of primes).
A bit manipulation alternative for small values is using an int or long as a bitset.
With this approach, to check a candidate i is not in the set you only need to check:
if (pow(2,i) & set == 0) // not in the set
else //already in the set
To enter an element i to the set:
set = set | pow(2,i)
A better approach will be to populate a list with all the numbers, shuffle it with fisher-yates shuffle, and iterate it for generating new random numbers.
If I have to avoid these methods above and use some
mathematical/logical/bitwise operation to find whether a random number
has been draw before or not, is there a way?
Subject to your contrived constraints yes, you can imitate a small bitset using bitwise operations:
You can choose different integer types on the right according to what size you need.
bitset code bitwise code
std::bitset<32> x; unsigned long x = 0;
if (x[i]) { ... } if (x & (1UL << i)) { ... }
// assuming v is 0 or 1
x[i] = v; x = (x & ~(1UL << i)) | ((unsigned long)v << i);
x[i] = true; x |= (1UL << i);
x[i] = false; x &= ~(1UL << i);
For a larger set (beyond the size in bits of unsigned long long), you will need an array of your chosen integer type. Divide the index by the width of each value to know what index to look up in the array, and use the modulus for the bit shifts. This is basically what bitset does.
I'm assuming that the various answers that tell you how best to shuffle 10 numbers are missing the point entirely: that your contrived constraints are there because you do not in fact want or need to know how best to shuffle 10 numbers :-)
Keep a variable too map the drawn numbers. The i'th bit of that variable will be 1 if the number was drawn before:
int mapNumbers = 0;
int generateRand() {
if (mapNumbers & ((1 << 11) - 1) == ((1 << 11) - 1)) return; // return if all numbers have been generated
int x;
do {
x = newVal();
} while (!x & mapNumbers);
mapNumbers |= (1 << x);
return x;
}

How to generate a number in arbitrary range using random()={0..1} preserving uniformness and density?

Generate a random number in range [x..y] where x and y are any arbitrary floating point numbers. Use function random(), which returns a random floating point number in range [0..1] from P uniformly distributed numbers (call it "density"). Uniform distribution must be preserved and P must be scaled as well.
I think, there is no easy solution for such problem. To simplify it a bit, I ask you how to generate a number in interval [-0.5 .. 0.5], then in [0 .. 2], then in [-2 .. 0], preserving uniformness and density? Thus, for [0 .. 2] it must generate a random number from P*2 uniformly distributed numbers.
The obvious simple solution random() * (x - y) + y will generate not all possible numbers because of the lower density for all abs(x-y)>1.0 cases. Many possible values will be missed. Remember, that random() returns only a number from P possible numbers. Then, if you multiply such number by Q, it will give you only one of P possible values, scaled by Q, but you have to scale density P by Q as well.
If I understand you problem well, I will provide you a solution: but I would exclude 1, from the range.
N = numbers_in_your_random // [0, 0.2, 0.4, 0.6, 0.8] will be 5
// This turns your random number generator to return integer values between [0..N[;
function randomInt()
{
return random()*N;
}
// This turns the integer random number generator to return arbitrary
// integer
function getRandomInt(maxValue)
{
if (maxValue < N)
{
return randomInt() % maxValue;
}
else
{
baseValue = randomInt();
bRate = maxValue DIV N;
bMod = maxValue % N;
if (baseValue < bMod)
{
bRate++;
}
return N*getRandomInt(bRate) + baseValue;
}
}
// This will return random number in range [lower, upper[ with the same density as random()
function extendedRandom(lower, upper)
{
diff = upper - lower;
ndiff = diff * N;
baseValue = getRandomInt(ndiff);
baseValue/=N;
return lower + baseValue;
}
If you really want to generate all possible floating point numbers in a given range with uniform numeric density, you need to take into account the floating point format. For each possible value of your binary exponent, you have a different numeric density of codes. A direct generation method will need to deal with this explicitly, and an indirect generation method will still need to take it into account. I will develop a direct method; for the sake of simplicity, the following refers exclusively to IEEE 754 single-precision (32-bit) floating point numbers.
The most difficult case is any interval that includes zero. In that case, to produce an exactly even distribution, you will need to handle every exponent down to the lowest, plus denormalized numbers. As a special case, you will need to split zero into two cases, +0 and -0.
In addition, if you are paying such close attention to the result, you will need to make sure that you are using a good pseudorandom number generator with a large enough state space that you can expect it to hit every value with near-uniform probability. This disqualifies the C/Unix rand() and possibly the*rand48() library functions; you should use something like the Mersenne Twister instead.
The key is to dissect the target interval into subintervals, each of which is covered by different combination of binary exponent and sign: within each subinterval, floating point codes are uniformly distributed.
The first step is to select the appropriate subinterval, with probability proportional to its size. If the interval contains 0, or otherwise covers a large dynamic range, this may potentially require a number of random bits up to the full range of the available exponent.
In particular, for a 32-bit IEEE-754 number, there are 256 possible exponent values. Each exponent governs a range which is half the size of the next greater exponent, except for the denormalized case, which is the same size as the smallest normal exponent region. Zero can be considered the smallest denormalized number; as mentioned above, if the target interval straddles zero, the probability of each of +0 and -0 should perhaps be cut in half, to avoid doubling its weight.
If the subinterval chosen covers the entire region governed by a particular exponent, all that is necessary is to fill the mantissa with random bits (23 bits, for 32-bit IEEE-754 floats). However, if the subinterval does not cover the entire region, you will need to generate a random mantissa that covers only that subinterval.
The simplest way to handle both the initial and secondary random steps may be to round the target interval out to include the entirety of all exponent regions partially covered, then reject and retry numbers that fall outside it. This allows the exponent to be generated with simple power-of-2 probabilities (e.g., by counting the number of leading zeroes in your random bitstream), as well as providing a simple and accurate way of generating a mantissa that covers only part of an exponent interval. (This is also a good way of handling the +/-0 special case.)
As another special case: to avoid inefficient generation for target intervals which are much smaller than the exponent regions they reside in, the "obvious simple" solution will in fact generate fairly uniform numbers for such intervals. If you want exactly uniform distributions, you can generate the sub-interval mantissa by using only enough random bits to cover that sub-interval, while still using the aforementioned rejection method to eliminate values outside the target interval.
well, [0..1] * 2 == [0..2] (still uniform)
[0..1] - 0.5 == [-0.5..0.5] etc.
I wonder where have you experienced such an interview?
Update: well, if we want to start caring about losing precision on multiplication (which is weird, because somehow you did not care about that in the original task, and pretend we care about "number of values", we can start iterating. In order to do that, we need one more function, which would return uniformly distributed random values in [0..1) — which can be done by dropping the 1.0 value would it ever appear. After that, we can slice the whole range in equal parts small enough to not care about losing precision, choose one randomly (we have enough randomness to do that), and choose a number in this bucket using [0..1) function for all parts but the last one.
Or, you can come up with a way to code enough values to care about—and just generate random bits for this code, in which case you don't really care whether it's [0..1] or just {0, 1}.
Let me rephrase your question:
Let random() be a random number generator with a discrete uniform distribution over [0,1). Let D be the number of possible values returned by random(), each of which is precisely 1/D greater than the previous. Create a random number generator rand(L, U) with a discrete uniform distribution over [L, U) such that each possible value is precisely 1/D greater than the previous.
--
A couple quick notes.
The problem in this form, and as you phrased it is unsolvable. That
is, if N = 1 there is nothing we can do.
I don't require that 0.0 be one of the possible values for random(). If it is not, then it is possible that the solution below will fail when U - L < 1 / D. I'm not particularly worried about that case.
I use all half-open ranges because it makes the analysis simpler. Using your closed ranges would be simple, but tedious.
Finally, the good stuff. The key insight here is that the density can be maintained by independently selecting the whole and fractional parts of the result.
First, note that given random() it is trivial to create randomBit(). That is,
randomBit() { return random() >= 0.5; }
Then, if we want to select one of {0, 1, 2, ..., 2^N - 1} uniformly at random, that is simple using randomBit(), just generate each of the bits. Call this random2(N).
Using random2() we can select one of {0, 1, 2, ..., N - 1}:
randomInt(N) { while ((val = random2(ceil(log2(N)))) >= N); return val; }
Now, if D is known, then the problem is trivial as we can reduce it to simply choosing one of floor((U - L) * D) values uniformly at random and we can do that with randomInt().
So, let's assume that D is not known. Now, let's first make a function to generate random values in the range [0, 2^N) with the proper density. This is simple.
rand2D(N) { return random2(N) + random(); }
rand2D() is where we require that the difference between consecutive possible values for random() be precisely 1/D. If not, the possible values here would not have uniform density.
Next, we need a function that selects a value in the range [0, V) with the proper density. This is similar to randomInt() above.
randD(V) { while ((val = rand2D(ceil(log2(V)))) >= V); return val; }
And finally...
rand(L, U) { return L + randD(U - L); }
We now may have offset the discrete positions if L / D is not an integer, but that is unimportant.
--
A last note, you may have noticed that several of these functions may never terminate. That is essentially a requirement. For example, random() may have only a single bit of randomness. If I then ask you to select from one of three values, you cannot do so uniformly at random with a function that is guaranteed to terminate.
Consider this approach:
I'm assuming the base random number generator in the range [0..1]
generates among the numbers
0, 1/(p-1), 2/(p-1), ..., (p-2)/(p-1), (p-1)/(p-1)
If the target interval length is less than or equal to 1,
return random()*(y-x) + x.
Else, map each number r from the base RNG to an interval in the
target range:
[r*(p-1)*(y-x)/p, (r+1/(p-1))*(p-1)*(y-x)/p]
(i.e. for each of the P numbers assign one of P intervals with length (y-x)/p)
Then recursively generate another random number in that interval and
add it to the interval begin.
Pseudocode:
const p;
function rand(x, y)
r = random()
if y-x <= 1
return x + r*(y-x)
else
low = r*(p-1)*(y-x)/p
high = low + (y-x)/p
return x + low + rand(low, high)
In real math: the solution is just the provided:
return random() * (upper - lower) + lower
The problem is that, even when you have floating point numbers, only have a certain resolution. So what you can do is apply above function and add another random() value scaled to the missing part.
If I make a practical example it becomes clear what I mean:
E.g. take random() return value from 0..1 with 2 digits accuracy, ie 0.XY, and lower with 100 and upper with 1100.
So with above algorithm you get as result 0.XY * (1100-100) + 100 = XY0.0 + 100.
You will never see 201 as result, as the final digit has to be 0.
Solution here would be to generate again a random value and add it *10, so you have accuracy of one digit (here you have to take care that you dont exceed your given range, which can happen, in this case you have to discard the result and generate a new number).
Maybe you have to repeat it, how often depends on how many places the random() function delivers and how much you expect in your final result.
In a standard IEEE format has a limited precision (i.e. double 53 bits). So when you generate a number this way, you never need to generate more than one additional number.
But you have to be careful that when you add the new number, you dont exceed your given upper limit. There are multiple solutions to it: First if you exceed your limit, you start from new, generating a new number (dont cut off or similar, as this changes the distribution).
Second possibility is to check the the intervall size of the missing lower bit range, and
find the middle value, and generate an appropiate value, that guarantees that the result will fit.
You have to consider the amount of entropy that comes from each call to your RNG. Here is some C# code I just wrote that demonstrates how you can accumulate entropy from low-entropy source(s) and end up with a high-entropy random value.
using System;
using System.Collections.Generic;
using System.Security.Cryptography;
namespace SO_8019589
{
class LowEntropyRandom
{
public readonly double EffectiveEntropyBits;
public readonly int PossibleOutcomeCount;
private readonly double interval;
private readonly Random random = new Random();
public LowEntropyRandom(int possibleOutcomeCount)
{
PossibleOutcomeCount = possibleOutcomeCount;
EffectiveEntropyBits = Math.Log(PossibleOutcomeCount, 2);
interval = 1.0 / PossibleOutcomeCount;
}
public LowEntropyRandom(int possibleOutcomeCount, int seed)
: this(possibleOutcomeCount)
{
random = new Random(seed);
}
public int Next()
{
return random.Next(PossibleOutcomeCount);
}
public double NextDouble()
{
return interval * Next();
}
}
class EntropyAccumulator
{
private List<byte> currentEntropy = new List<byte>();
public double CurrentEntropyBits { get; private set; }
public void Clear()
{
currentEntropy.Clear();
CurrentEntropyBits = 0;
}
public void Add(byte[] entropy, double effectiveBits)
{
currentEntropy.AddRange(entropy);
CurrentEntropyBits += effectiveBits;
}
public byte[] GetBytes(int count)
{
using (var hasher = new SHA512Managed())
{
count = Math.Min(count, hasher.HashSize / 8);
var bytes = new byte[count];
var hash = hasher.ComputeHash(currentEntropy.ToArray());
Array.Copy(hash, bytes, count);
return bytes;
}
}
public byte[] GetPackagedEntropy()
{
// Returns a compact byte array that represents almost all of the entropy.
return GetBytes((int)(CurrentEntropyBits / 8));
}
public double GetDouble()
{
// returns a uniformly distributed number on [0-1)
return (double)BitConverter.ToUInt64(GetBytes(8), 0) / ((double)UInt64.MaxValue + 1);
}
public double GetInt(int maxValue)
{
// returns a uniformly distributed integer on [0-maxValue)
return (int)(maxValue * GetDouble());
}
}
class Program
{
static void Main(string[] args)
{
var random = new LowEntropyRandom(2); // this only provides 1 bit of entropy per call
var desiredEntropyBits = 64; // enough for a double
while (true)
{
var adder = new EntropyAccumulator();
while (adder.CurrentEntropyBits < desiredEntropyBits)
{
adder.Add(BitConverter.GetBytes(random.Next()), random.EffectiveEntropyBits);
}
Console.WriteLine(adder.GetDouble());
Console.ReadLine();
}
}
}
}
Since I'm using a 512-bit hash function, that is the max amount of entropy that you can get out of the EntropyAccumulator. This could be fixed, if necessarily.
If I understand your problem correctly, it's that rand() generates finely spaced but ultimately discrete random numbers. And if we multiply it by (y-x) which is large, this spreads these finely spaced floating point values out in a way that is missing many of the floating point values in the range [x,y]. Is that all right?
If so, I think we have a solution already given by Dialecticus. Let me explain why he is right.
First, we know how to generate a random float and then add another floating point value to it. This may produce a round off error due to addition, but it will be in the last decimal place only. Use doubles or something with finer numerical resolution if you want better precision. So, with that caveat, the problem is no harder than finding a random float in the range [0,y-x] with uniform density. Let's say y-x = z. Obviously, since z is a floating point it may not be an integer. We handle the problem in two steps: first we generate the random digits to the left of the decimal point and then generate the random digits to the right of it. Doing both uniformly means their sum is uniformly distributed across the range [0,z] too. Let w be the largest integer <= z. To answer our simplified problem, we can first pick a random integer from the range {0,1,...,w}. Then, step #2 is to add a random float from the unit interval to this random number. This isn't multiplied by any possibly large values, so it has as fine a resolution as the numerical type can have. (Assuming you're using an ideal random floating point number generator.)
So what about the corner case where the random integer was the largest one (i.e. w) and the random float we added to it was larger than z - w so that the random number exceeds the allowed maximum? The answer is simple: do all of it again and check the new result. Repeat until you get a digit in the allowed range. It's an easy proof that a uniformly generated random number which is tossed out and generated again if it's outside an allowed range results in a uniformly generated random in the allowed range. Once you make this key observation, you see that Dialecticus met all your criteria.
When you generate a random number with random(), you get a floating point number between 0 and 1 having an unknown precision (or density, you name it).
And when you multiply it with a number (NUM), you lose this precision, by lg(NUM) (10-based logarithm). So if you multiply by 1000 (NUM=1000), you lose the last 3 digits (lg(1000) = 3).
You may correct this by adding a smaller random number to the original, which has this missing 3 digits. But you don't know the precision, so you can't determine where are they exactly.
I can imagine two scenarios:
(X = range start, Y = range end)
1: you define the precision (PREC, eg. 20 digits, so PREC=20), and consider it enough to generate a random number, so the expression will be:
( random() * (Y-X) + X ) + ( random() / 10 ^ (PREC-trunc(lg(Y-X))) )
with numbers: (X = 500, Y = 1500, PREC = 20)
( random() * (1500-500) + 500 ) + ( random() / 10 ^ (20-trunc(lg(1000))) )
( random() * 1000 + 500 ) + ( random() / 10 ^ (17) )
There are some problems with this:
2 phase random generation (how much will it be random?)
the first random returns 1 -> result can be out of range
2: guess the precision by random numbers
you define some tries (eg. 4) to calculate the precision by generating random numbers and count the precision every time:
- 0.4663164 -> PREC=7
- 0.2581916 -> PREC=7
- 0.9147385 -> PREC=7
- 0.129141 -> PREC=6 -> 7, correcting by the average of the other tries
That's my idea.

How can I randomly iterate through a large Range?

I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on #bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
Define the range to be from 0 to
some number N
Generate a random starting point x[0] inside N
Generate an iterator Q less than N
Generate successive points x[n] by adding Q to
the previous point and wrapping around if needed. That
is, x[n+1] = (x[n] + Q) % N
Repeat until you generate a new point equal to the starting point.
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.
This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.
Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
As #Turtle answered, you problem doesn't have a solution. #KandadaBoggu and #bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.
Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
puts "Oh, no!" if tried[x]
tried[x] = true
}
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
Break the range in to manageable batches as shown below:
def range_walker range, batch_size = 100
size = (range.end - range.begin) + 1
n = size/batch_size
n.times do |i|
x = i * batch_size + range.begin
y = x + batch_size
(x...y).sort_by{rand}.each{|z| p z}
end
d = (range.end - size%batch_size + 1)
(d..range.end).sort_by{rand}.each{|z| p z }
end
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
you can randomly iterate an array with shuffle method
a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
return (last_value + prime_number) % sample_size
}
If you call this like so:
sample = 10
For i = 1 to sample
last_value = fullCycleStep(sample, last_value)
print last_value
next
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
def gen_random_indices
# Assume your input range is (0..(10**3))
(0..3).sort_by{rand}.each do |a|
(0..3).sort_by{rand}.each do |b|
(0..3).sort_by{rand}.each do |c|
yield "#{a}#{b}#{c}".to_i
end
end
end
end
gen_random_indices do |idx|
run_test_with_index(idx)
end
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.
The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).
[Edit]: Taking into account #klew and #Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:
# partition a range into an array of N equal-sized ranges
def partition(range, n)
ranges = []
first = range.first
last = range.last
length = last - first + 1
step = length / n # integer division
((first + step - 1)..last).step(step) { |i|
ranges << (first..i)
first = i + 1
}
# append any extra onto the last element
ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
ranges
end
I hope the code comments help shed some light on my original question.
pastebin: full source
Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.
For a prohibitively large space, like
space = -10..1000000000000000000000
You can add this method to Range.
class Range
M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727
def each_random(seed = 0)
return to_enum(__method__) { size } unless block_given?
unless first.kind_of? Integer
raise TypeError, "can't randomly iterate from #{first.class}"
end
sample_size = self.end - first + 1
sample_size -= 1 if exclude_end?
j = coprime sample_size
v = seed % sample_size
each do
v = (v + j) % sample_size
yield first + v
end
end
protected
def gcd(a,b)
b == 0 ? a : gcd(b, a % b)
end
def coprime(a, z = M127)
gcd(a, z) == 1 ? z : coprime(a, z + 1)
end
end
You could then
space.each_random { |i| puts i }
729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to #nick-steele and #bta for the approach.
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.
So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.
The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
fun permute(i, length, seed) {
i = hash(i, 0xFFFF, seed)
while(i >= length): i = hash(i, 0xFFFF, seed)
return i
}
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
fun next_pow_2(length) {
# This implementation is for clarity.
# See Kensler's paper for one way to do it fast.
p = 1
while (p < length): p *= 2
return p
}
permute(i, length, seed) {
mask = next_pow_2(length)-1
i = hash(i, mask, seed) & mask
while(i >= length): i = hash(i, mask, seed) & mask
return i
}
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.

Resources