Related
Select an item from a stream at random with uniform probability, using constant space
The stream provides the following operations:
class Stream:
def __init__(self, data):
self.data = list(data)
def read(self):
if not self.data:
return None
head, *self.data = self.data
return head
def peek(self):
return self.data[0] if self.data else None
The elements in the stream (ergo the elements of data) are of constant size and neither of them is None, so None signals end of stream. The length of stream can only be learned by consuming the entire stream. And note that counting the number of elements consumes O(log n) space.
I believe there is no way to uniformly choose an item from the stream at random using O(1) space.
Can anyone (dis)prove this?
Generate a random number for each element, and remember the element with the smallest number.
That's the answer I like best, but the answer you're probably looking for is:
If the stream is N items long, then the probability of returning the Nth item is 1/N. Since this probability is different for every N, any machine that can accomplish this task must enter different states after reading streams of different lengths. Since the number of possible lengths is unbounded, the required number of possible states is unbounded, and the machine will require an unbounded amount of memory to distinguish between them.
In constant space? Sure, Reservoir Sampling, constant space, linear time
Some lightly tested code
import numpy as np
def stream(size):
for k in range(size):
yield k
def resSample(ni, s):
ret = np.empty(ni, dtype=np.int64)
k = 0
for k, v in enumerate(s):
if k < ni:
ret[k] = v
else:
idx = np.random.randint(0, k+1)
if (idx < ni):
ret[idx] = v
return ret
SIZE = 12
s = stream(SIZE)
q = resSample(1, s)
print(q)
I see there is a question wrt RNG. Suppose I have true RNG, hardware device which returns single bit at a time. We use it only in the code where get index.
if (idx < ni):
The only way condition would be triggered for one element to be select
is when ni=1 and thus idx only could be ZERO.
Thus np.random.randint(0, k+1) with such implementation would be something like
def trng(k):
for _ in range(k+1):
if next_true_bit():
return 1 # as soon as it is not 0, we don't care
return 0 # all bits are zero, index is zero, proceed with exchange
QED, such realization is possible and therefore this sampling method shall work
UPDATE
#kyrill is probably right - I have to have a count going (log2(k) storage), so far see no way to avoid it. Even with RNG trick, I have to sample 0 with probability 1/k and this k is growing with the size of the stream.
I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.
Some properties of this technique:
You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0.
When you add node 2, it steals 1/3 of the items from the previous 2 nodes.
And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.
Consider a list [1,1,1,...,1,0,0,...,0] (an arbitrary list of zeros and ones). We want the whole possible permutations in this array, there'll be binomial(l,k) permutations (l stands for the length of the list and k for the number of ones in the list).
Right now, I have tested three different algorithms to generate the whole possible permutations, one that uses a recurrent function, one that calculates
the permutations via calculating the interval number [1,...,1,0,0,...,0]
to [0,0,...0,1,1,...,1] (since this can be seen as a binary number interval), and one that calculates the permutations using lexicographic order.
So far, the first two approaches fail in performance when the permutations are
approx. 32. The lexicographic technique works still pretty nice (only a few miliseconds to finish).
My question is, specifically for julia, which is the best way to calculate
permutations as I described earlier? I don't know too much in combinatorics, but I think a descent benchmark would be to generate all permutations from the total binomial(l,l/2)
As you have mentioned yourself in the comments, the case where l >> k is definitely desired. When this is the case, we can substantially improve performance by not handling vectors of length l until we really need them, and instead handle a list of indexes of the ones.
In the RAM-model, the following algorithm will let you iterate over all the combinations in space O(k^2), and time O(k^2 * binom(l,k))
Note however, that every time you generate a bit-vector from an index combination, you incur an overhead of O(l), in which you will also have the lower-bound (for all combinations) of Omega(l*binom(l,k)), and the memory usage grows to Omega(l+k^2).
The algorithm
"""
Produces all `k`-combinations of integers in `1:l` with prefix `current`, in a
lexicographical order.
# Arguments
- `current`: The current combination
- `l`: The parent set size
- `k`: The target combination size
"""
function combination_producer(l, k, current)
if k == length(current)
produce(current)
else
j = (length(current) > 0) ? (last(current)+1) : 1
for i=j:l
combination_producer(l, k, [current, i])
end
end
end
"""
Produces all combinations of size `k` from `1:l` in a lexicographical order
"""
function combination_producer(l,k)
combination_producer(l,k, [])
end
Example
You can then iterate over all the combinations as follows:
for c in #task(combination_producer(l, k))
# do something with c
end
Notice how this algorithm is resumable: You can stop the iteration whenever you want, and continue again:
iter = #task(combination_producer(5, 3))
for c in iter
println(c)
if c[1] == 2
break
end
end
println("took a short break")
for c in iter
println(c)
end
This produces the following output:
[1,2,3]
[1,2,4]
[1,2,5]
[1,3,4]
[1,3,5]
[1,4,5]
[2,3,4]
took a short break
[2,3,5]
[2,4,5]
[3,4,5]
If you want to get a bit-vector out of c then you can do e.g.
function combination_to_bitvector(l, c)
result = zeros(l)
result[c] = 1
result
end
where l is the desired length of the bit-vector.
One way to get that is for the natural numbers (1,..,n) we factorise each and see if they have any repeated prime factors, but that would take a lot of time for large n. So is there any better way to get the square-free numbers from 1,..,n ?
You could use Eratosthenes Sieve's modified version:
Take a bool array 1..n;
Precalc all squares that are less than n; that's O(sqrt(N));
For each square and its multiples make the bool array entry false...
From http://mathworld.wolfram.com/Squarefree.html
There is no known polynomial time
algorithm for recognizing squarefree
integers or for computing the
squarefree part of an integer. In
fact, this problem may be no easier
than the general problem of integer
factorization (obviously, if an
integer can be factored completely,
is squarefree iff it contains no
duplicated factors). This problem is
an important unsolved problem in
number theory because computing the
ring of integers of an algebraic
number field is reducible to computing
the squarefree part of an integer
(Lenstra 1992, Pohst and Zassenhaus
1997).
The most direct thing that comes to mind is to list the primes up to n and select at most one of each. That's not easy for large n (e.g. here's one algorithm), but I'm not sure this problem is either.
One way to do it is to use a sieve, similar to Eratosthenes'.
#Will_Ness wrote a "quick" prime sieve as follows in Python.
from itertools import count
# ideone.com/
def postponed_sieve(): # postponed sieve, by Will Ness
yield 2; yield 3; yield 5; yield 7; # original code David Eppstein,
sieve = {} # Alex Martelli, ActiveState Recipe 2002
ps = postponed_sieve() # a separate base Primes Supply:
p = next(ps) and next(ps) # (3) a Prime to add to dict
q = p*p # (9) its sQuare
for c in count(9,2): # the Candidate
if c in sieve: # c's a multiple of some base prime
s = sieve.pop(c) # i.e. a composite ; or
elif c < q:
yield c # a prime
continue
else: # (c==q): # or the next base prime's square:
s=count(q+2*p,2*p) # (9+6, by 6 : 15,21,27,33,...)
p=next(ps) # (5)
q=p*p # (25)
for m in s: # the next multiple
if m not in sieve: # no duplicates
break
sieve[m] = s # original test entry: ideone.com/WFv4f
With a little effort, this can be used to pop out square-free integers, using the postponed_sieve() to serve as a basis for sieving by as few squares as possible:
def squarefree(): # modified sieve of Will Ness
yield 1; yield 2; yield 3; # original code David Eppstein,
sieve = {} # Alex Martelli, ActiveState Recipe 2002
ps = postponed_sieve() # a base Primes Supply:
p = next(ps) # (2)
q = p*p # (4)
for c in count(4): # the Candidate
if c in sieve: # c's a multiple of some base square
s = sieve.pop(c) # i.e. not square-free ; or
elif c < q:
yield c # square-free
continue
else: # (c==q): # or the next base prime's square:
s=count(2*q,q) # (4+4, by 4 : 8,12,16,20...)
p=next(ps) # (3)
q=p*p # (9)
for m in s: # the next multiple
if m not in sieve: # no duplicates
break
sieve[m] = s
It's pretty quick, kicking out the first million in about .8s on my laptop.
Unsurprisingly, this shows that this is effectively the same problem as sieving primes, but with much greater density.
You should probably look into the sieve of Atkin. Of course this eliminates all non-primes (not just perfect squares) so it might be more work than you need.
Googling a little bit I've found this page where a J program is explained. A part from the complex syntax, the algorithm allows to check whether a number is square-free or not:
generate a list of perfect square PS,
take your number N and divide it by
the numbers in the list PS
if there is only 1 whole number in the list,
then N is square-free
You could implement the algorithm in your preferred language and iterate it on any number from 1 to n.
http://www.marmet.org/louis/sqfgap/
Check out the section "Basic algorithm: the sieve of Eratosthenes", which is what Armen suggested. The next section is "Improvements of the algorithm".
Also, FWIW, the Moebius function and square-free numbers are related.
I have found a better algorithm to calculate how many square-free numbers in a interval such as [n,m]. We can get prime that less than sqrt(m), then we should minus the multiples of those prime's square, then plus the multiples of each two primes' product less than m, then minus tree ,then plus four.... at the last we will get the answer. Certainly it runs in O(sqrt(m)).
import math
def squarefree(n):
t=round(math.sqrt(n))
if n<2:
return True
if t*t==n:
return False
if t<=2:
return True
for i in range(2,t):
if n%(i*i)==0:
return False
else:
return True
I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on #bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
Define the range to be from 0 to
some number N
Generate a random starting point x[0] inside N
Generate an iterator Q less than N
Generate successive points x[n] by adding Q to
the previous point and wrapping around if needed. That
is, x[n+1] = (x[n] + Q) % N
Repeat until you generate a new point equal to the starting point.
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.
This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.
Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
As #Turtle answered, you problem doesn't have a solution. #KandadaBoggu and #bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.
Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
puts "Oh, no!" if tried[x]
tried[x] = true
}
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
Break the range in to manageable batches as shown below:
def range_walker range, batch_size = 100
size = (range.end - range.begin) + 1
n = size/batch_size
n.times do |i|
x = i * batch_size + range.begin
y = x + batch_size
(x...y).sort_by{rand}.each{|z| p z}
end
d = (range.end - size%batch_size + 1)
(d..range.end).sort_by{rand}.each{|z| p z }
end
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
you can randomly iterate an array with shuffle method
a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
return (last_value + prime_number) % sample_size
}
If you call this like so:
sample = 10
For i = 1 to sample
last_value = fullCycleStep(sample, last_value)
print last_value
next
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
def gen_random_indices
# Assume your input range is (0..(10**3))
(0..3).sort_by{rand}.each do |a|
(0..3).sort_by{rand}.each do |b|
(0..3).sort_by{rand}.each do |c|
yield "#{a}#{b}#{c}".to_i
end
end
end
end
gen_random_indices do |idx|
run_test_with_index(idx)
end
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.
The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).
[Edit]: Taking into account #klew and #Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:
# partition a range into an array of N equal-sized ranges
def partition(range, n)
ranges = []
first = range.first
last = range.last
length = last - first + 1
step = length / n # integer division
((first + step - 1)..last).step(step) { |i|
ranges << (first..i)
first = i + 1
}
# append any extra onto the last element
ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
ranges
end
I hope the code comments help shed some light on my original question.
pastebin: full source
Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.
For a prohibitively large space, like
space = -10..1000000000000000000000
You can add this method to Range.
class Range
M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727
def each_random(seed = 0)
return to_enum(__method__) { size } unless block_given?
unless first.kind_of? Integer
raise TypeError, "can't randomly iterate from #{first.class}"
end
sample_size = self.end - first + 1
sample_size -= 1 if exclude_end?
j = coprime sample_size
v = seed % sample_size
each do
v = (v + j) % sample_size
yield first + v
end
end
protected
def gcd(a,b)
b == 0 ? a : gcd(b, a % b)
end
def coprime(a, z = M127)
gcd(a, z) == 1 ? z : coprime(a, z + 1)
end
end
You could then
space.each_random { |i| puts i }
729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to #nick-steele and #bta for the approach.
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.
So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.
The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
fun permute(i, length, seed) {
i = hash(i, 0xFFFF, seed)
while(i >= length): i = hash(i, 0xFFFF, seed)
return i
}
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
fun next_pow_2(length) {
# This implementation is for clarity.
# See Kensler's paper for one way to do it fast.
p = 1
while (p < length): p *= 2
return p
}
permute(i, length, seed) {
mask = next_pow_2(length)-1
i = hash(i, mask, seed) & mask
while(i >= length): i = hash(i, mask, seed) & mask
return i
}
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.