How to generate a random number from specified discrete distribution? - random

Lets say we have some discrete distribution with finite number of possible results, is it possible to generate a random number from this distribution faster than in O(logn), where n is number possible results?
How to make it in O(logn):
- Make an array with cumulative probability (Array[i] = Probability that random number will be less or equal to i)
- Generate random number from uniform distribution (lets denote it by k)
- Find the smallest i such that k < Array[i]. It can be done using binary search.
- i is our random number.

Walker's alias method can draw a sample in constant worst-case time, using some auxiliary arrays of size n which need to be precomputed. This method is described in Chapter 3 of Devroye's book on sampling and is implemented in the R sample() function. You can get code from R's source code or this thread. A 1991 paper by Vose claims to reduce the initialization cost.
Note that your question isn't well-defined unless you specify the exact form of the input and how many random numbers you want to draw. For example, if the input is an array giving the probability of each result, then your algorithm is not O(log n) because it requires first computing the cumulative probabilities which takes O(n) time from the input array.
If you intend to draw many samples then the cost of generating a single sample is not so important. Instead what matters is the total cost to generate m results, and the peak memory required. In this regard, the alias method very good. If you want to generate the samples all at once, use the O(n+m) algorithm posted here and then shuffle the results.

Related

speed-up computation of sum over several subsets

Let's say I have a huge array of doubles w[] indexed from 0 to n-1.
I also have a list of m subsets of [0;n-1]. For each subset S, I am trying to compute the sums of w[i] over S.
Obviously I can compute this separately for each subset, which is going to be in O(m * n).
However is there any faster way to do this? I'm talking from a practical standpoint, as I think you can't have a lower asymptotic bound. Is it possible to pre-process all the subsets and store them in such a way that computing all the sums is faster?
Thanks!
edit :
to give some order of magnitude, my n would be around 20 millions, and m around ~200.
For subsets that are dense (or nearly dense) you may be able to speed up the computation by computing a running sum of the elements. That is, create another array in parallel with w, where each element in the parallel array contains the sum of the elements of w up to that point.
To compute the sum for a dense subset, you that the starting and ending positions in the parallel array, and subtract the running sum at the start from the running sum at the end. The difference between the two is (ignoring rounding errors) the sum for that subset.
For a nearly dense subset, you start by doing the same, then subtract off the values of the (relatively few) items in that range that aren't part of the set.
These may not produce exactly the same result as you'd get by naively summing the subset though. If you need better accuracy, you'd probably want to use Kahan summation for your array of running sums, and possibly preserve its error residual at each point, to be taken into account when doing the subtraction.

Hamming numbers for O(N) speed and O(1) memory

Disclaimer: there are many questions about it, but I didn't find any with requirement of constant memory.
Hamming numbers is a numbers 2^i*3^j*5^k, where i, j, k are natural numbers.
Is there a possibility to generate Nth Hamming number with O(N) time and O(1) (constant) memory? Under generate I mean exactly the generator, i.e. you can only output the result and not read the previously generated numbers (in that case memory will be not constant). But you can save some constant number of them.
I see only best algorithm with constant memory is not better than O(N log N), for example, based on priority queue. But is there mathematical proof that it is impossible to construct an algorithm in O(N) time?
First thing to consider here is the direct slice enumeration algorithm which can be seen e.g. in this SO answer, enumerating the triples (k,j,i) in the vicinity of a given logarithm value (base 2) of a sequence member so that target - delta < k*log2_5 + j*log2_3 + i < target + delta, progressively calculating the cumulative logarithm while picking the j and k so that i is directly known.
It is thus an N2/3-time algo producing N2/3-wide slices of the sequence at a time (with k*log2_5 + j*log2_3 + i close to the target value, so these triples form the crust of the tetrahedron filled with the Hamming sequence triples 1), meaning O(1) time per produced number, thus producing N sequence members in O(N) amortized time and O(N2/3)-space. That's no improvement over the baseline Dijkstra's algorithm 2  with the same complexities, even non-amortized and with better constant factors.
To make it O(1)-space, the crust width will need to be narrowed as we progress along the sequence. But the narrower the crust, the more and more misses will there be when enumerating its triples -- and this is pretty much the proof you asked for. The constant slice size means O(N2/3) work per the O(1) slice, for an overall O(N5/3) amortized time, O(1) space algorithm.
These are the two end points on this spectrum: from N1-time, N2/3-space to N0 space, N5/3-time, amortized.
1 Here's the image from Wikipedia, with logarithmic vertical scale:
This essentially is a tetrahedron of Hamming sequence triples (i,j,k) stretched in space as (i*log2, j*log3, k*log5), seen from the side. The image is a bit askew, if it's to be true 3D picture.
edit: 2 It seems I forgot that the slices have to be sorted, as they are produced out of order by the j,k-enumerations. This changes the best complexity for producing the sequence's N numbers in order via the slice algorithm to O(N2/3 log N) time, O(N2/3) space and makes Dijkstra's algorithm a winner there. It doesn't change the top bound of O(N5/3) time though, for the O(1) slices.

Approximate the typical value of a sample

Say I have a sample of N positive real numbers and I want to find a "typical" value for these numbers. Of course "typical" is not very well defined but one could think of the following more concrete problem :
The numbers are distributed such that (roughly speaking) a fraction (1-epsilon) of them is drawn from a Gaussian with positive mean m > 0 and mean square deviation sigma << m and a small fraction epsilon of them is drawn from some other distribution, heavy tailed both for large and small numbers. I want to estimate the mean of the Gaussian within a few standard deviation.
A solution would be to compute the median but while it is O(N), constant factors are not so good for moderate N and moreover it requires quite a bit of coding. I am ready to give up precision on my estimate against code simplicity and/or small N performance (say N is 10 or 20 for instance, and I have at most one or two outliers).
Do you have any suggestion ?
(For instance, if my outliers where only coming from large values, I would compute the average of the log of my values and exponentiate it. Under some further assumptions this gives me, generally, a good estimate and I can compute it easily and with a sharp O(N)).
You could take the mean of the numbers excluding the min and max. The formula is (sum - min - max) / (N - 2), and the terms in the numerator can be computed simply with one pass (watch out for floating point issues though).
I think you should reconsider the median, either using quickselect or Blum-Floyd-Pratt-Rivest-Tarjan (as implented here by Coetzee). It's fast and robust.
If you need better speed you might consider picking a fixed number of random elements and taking their median. This is sublinear (O(1) or O(log n) depending on the model) and works well for large sets.

Generate N quasi random numbers in less than O(N)

This was inspired by a question at a job interview: how do you efficiently generate N unique random numbers? Their security and distribution/bias don't matter.
I proposed a naive way of calling rand() N times and eliminating dupes by trial and error, thus getting inefficient and flawed solution. Then I've read this SO question, these algorithms are great for getting quality unique numbers and they are O(N).
But I suspect there are ways to get low-quality unique random numbers for dummy tasks in less than O(N) time complexity. I got some possible ideas:
Store many precomputed lists each containing N numbers and retrieve one list randomly. Complexity is O(1) for fixed N. Storage space used is O(NR) where R is number of lists.
Generate N/2 unique random numbers and then divide them by 2 inequal parts (floor/ceil for odd numbers, n+1/n-1 for even). I know this is flawed (duplicates can pop up) and O(N/2) is still O(N). This is more of a food for thought.
Generate one big random number and then squeeze more variants from it by some fixed manipulations like bitwise operations, factorization, recursion, MapReduce or something else.
Use a quasi-random sequence somehow (not a math guy, just googled this term).
Your ideas?
Presumably this routine has some kind of output (i.e. the results are written to an array of some kind). Populating an array (or some other data-structure) of size N is at least an O(N) operation, so you can't do better than O(N).
You can consequently generate a random number, and if the result array contains it, just add to it the maximum number of already generated numbers.
Detecting if a number already generated is O(1) (using a hash set). So it's O(n) and with only N random() calls.
Of course, this is an assumption that we do not overflow the upper limit (i.e. BigInteger).

Does Repeating a Biased Random Shuffle Reduce the Bias?

I'd like to produce fast random shuffles repeatedly with minimal bias.
It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.
To shuffle an array a of n elements:
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
But what if the RNG is biased (but fast)?
Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.
My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?
My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.
Does anyone know of any research that addresses this?
As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?
I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?
if the RNG only has a period of 2^32 ~
10^9 we can not produce every possible
permutation of the 25 elements because
this is 25! ~ 10^25 permutations
This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.
As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.
I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.
A couple of points:
1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.
My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).
So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?
Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).
But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.
Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.
For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.
My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).
If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.
I can't completely answer your question, but this observation seemed too long for a comment.
What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.
It depends entirely on the bias. In general I would say "don't count on it".
Biased algorithm that converges to non-biased:
Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.
Biased algorithm that stays biased:
Shuffle all elements except the last one. Permanently biased towards not moving the last element.
More General Example:
Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.
Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.
So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.

Resources