I'd like to produce fast random shuffles repeatedly with minimal bias.
It's known that the Fisher-Yates shuffle is unbiased as long as the underlying random number generator (RNG) is unbiased.
To shuffle an array a of n elements:
for i from n − 1 downto 1 do
j ← random integer with 0 ≤ j ≤ i
exchange a[j] and a[i]
But what if the RNG is biased (but fast)?
Suppose I want to produce many random permutations of an array of 25 elements. If I use the Fisher-Yates algorithm with a biased RNG, then my permutation will be biased, but I believe this assumes that the 25-element array starts from the same state before each application of the shuffle algorithm. One problem, for example, is if the RNG only has a period of 2^32 ~ 10^9 we can not produce every possible permutation of the 25 elements because this is 25! ~ 10^25 permutations.
My general question is, if I leave the shuffled elements shuffled before starting each new application of the Fisher-Yates shuffle, would this reduce the bias and/or allow the algorithm to produce every permutation?
My guess is it would generally produce better results, but it seems like if the array being repeatedly shuffled had a number of elements that was related to the underlying RNG that the permutations could actually repeat more often than expected.
Does anyone know of any research that addresses this?
As a sub-question, what if I only want repeated permutations of 5 of the 25 elements in the array, so I use the Fisher-Yates algorithm to select 5 elements and stop before doing a full shuffle? (I use the 5 elements on the end of the array that got swapped.) Then I start over using the previous partially shuffled 25-element array to select another permutation of 5. Again, it seems like this would be better than starting from the original 25-element array if the underlying RNG had a bias. Any thoughts on this?
I think it would be easier to test the partial shuffle case since there are only 6,375,600 possible permutations of 5 out of 25 elements, so are there any simple tests to use to check for biases?
if the RNG only has a period of 2^32 ~
10^9 we can not produce every possible
permutation of the 25 elements because
this is 25! ~ 10^25 permutations
This is only true as long as the seed determines every successive selection. As long as your RNG can be expected to deliver a precisely even distribution over the range specified for each next selection, then it can produce every permutation. If your RNG cannot do that, having a larger seed base will not help.
As for your side question, you might as well reseed for every draw. However, reseeding the generator is only useful if reseeding it contains enough entropy. Time stamps don't contain much entropy, neither do algorithmic calculations.
I'm not sure what this solution is part of because you have not listed it, but if you are trying to calculate something from a larger domain using random input, there are probably better methods.
A couple of points:
1) Anyone using the Fisher Yates shuffle should read this and make doubly sure their implementation is correct.
2) Doesn't repeating the shuffle defeat the purpose of using a faster random number generator? Surely if you're going to have to repeat every shuffle 5 times to get the desired entropy you're better using a low bias generator.
3) Do you have a set up where you can test this? If so start trying things - Jeffs graphs make it clear that you can easily detect quite a lot of errors by using small decks and visually portraying the results.
My feeling is that with a biased RNG repeated runs of the Knuth shuffle would produce all the permutations, but I'm not able to prove it (it depends on the period of the RNG and how much biased it is).
So let's reverse the question: given an algorithm that requires a random input and a biased RNG, is it easier to de-skew the algorithm's output or to de-skew the RNG's output?
Unsurprisingly, the latter is much easier to do (and is of broader interest): there are several standard techniques to do it. A simple technique, due to Von Neumann, is: given a bitstream from a biased RNG, take bits in pairs, throw away every (0,0) and (1,1) pair, return a 1 for every (1,0) pair and a 0 for every (0,1) pair. This technique assumes that the bits are from a stream where each bit has the same probability of being a 0 or 1 as any other bit in the stream and that bits are not correlated. Elias generalized von Neumann's technique to a more efficient scheme (one where fewer bits are discarded).
But even strongly biased or correlated bits, may contain useful amounts of randomness, for example using a technique based on Fast Fourier Transform.
Another option is to feed the biased RNG output to a cryptographically strong function, for example a message digest algorithm, and use its output.
For further references on how to de-skew random number generators, I suggest you to read the Randomness Recommendations for Security RFC.
My point is that the quality if the output of a random-based algorithm is upper bounded by the entropy provided by the RNG: if it is extremely biased the output will be extremely biased, no matter what you do. The algorithm can't squeeze more entropy than the one contained in the biased random bitstream. Worse: it will probably lose some random bits. Even assuming that the algorithm works with a biased RNG, to obtain good result you'll have to put a computational effort at least as great as the effort that it would take to de-skew the RNG (but it probably will require more effort, since you'll have to both run the algorithm and "defeat" the biasing at the same time).
If your question is just theoretical, then please disregard this answer. If it is practical then please seriously think about de-skewing your RNG instead of making assumption about the output of the algorithm.
I can't completely answer your question, but this observation seemed too long for a comment.
What happens if you ensure that the number of random numbers pulled from your RNG for each iteration of Fisher-Yates has a high least common multiple with the RNG period? That may mean that you "waste" a random integer at the end of the algorithm. When shuffling 25 elements, you need 24 random numbers. If you pull one more random number at the end, making 25 random numbers, you're not guaranteed to have a repetition for much longer than the RNG period. Now, randomly, you could have the same 25 numbers occur in succession before reaching the period, of course. But, as 25 has no common factors other than 1 with 2^32, you wouldn't hit a guaranteed repetition until 25*(2^32). Now, that isn't a huge improvement, but you said this RNG is fast. What if the "waste" value was much larger? It may still not be practical to get every permutation, but you could at least increase the number you can reach.
It depends entirely on the bias. In general I would say "don't count on it".
Biased algorithm that converges to non-biased:
Do nothing half of the time, and a correct shuffle the other half. Converges towards non-biased exponentially. After n shuffles there is a 1-1/2^n chance the shuffle is non-biased and a 1/2^n chance the input sequence was selected.
Biased algorithm that stays biased:
Shuffle all elements except the last one. Permanently biased towards not moving the last element.
More General Example:
Think of a shuffle algorithm as a weighted directed graph of permutations, where the weights out of a node correspond to the probability of transitioning from one permutation to another when shuffled. A biased shuffle algorithm will have non-uniform weights.
Now suppose you filled one node in that graph with water, and water flowed from one node to the next based on the weights. The algorithm will converge to non-biased if the distribution of water converges to uniform no matter the starting node.
So in what cases will the water not spread out uniformly? Well, if you have a cycle of above-average weights, nodes in the cycle will tend to feed each other and stay above the average amount of water. They won't take all of it, since as they get more water the amount coming in decreases and the amount going out increases, but it will be above average.
Related
More specifically, is there an algorithm that can generate, deterministically, provided a seed, n integers from 0-(n-1), with no duplicates or missing numbers, in linear or sub-linear time and constant space?
All the answers i've found or seen online require linear space, as they need to store information about every digit in the sequence before they can give the first number at all. This becomes unreasonable memory usage in the millions/trillions of possible numbers, which is useful for random id generation. Is there an algorithm, say an iterative formula, which nicely spits out one number after another, without having to know any information about all the numbers before it or after it? Or am I living in a pipe dream right now?
My program is supposed to loop forever and give out via print every prime number it comes along. Doing this in x86-NASM btw.
My first attempt divided it by EVERY previous number until either the Carry is 0 (not a prime) or the result is 1.
MY second attempt improved this by only testing every second, so only odd numbers.
The third thing I am currently implementing is trying to not divide by EVERY previous number but rather all of the previous divided by 2, since you can't get an even number by dividing a number by something bigger than its half
Another thing that might help is to test it with only odd numbers, like the sieve of eratosthenes, but only excluding even numbers.
Anyway, if there is another thing I can do, all help welcome.
edit:
If you need to test an handful, possibly only one, of primes, the AKS primality test is polynomial in the length of n.
If you want to find a very big prime, of cryptographic size, then select a random range of odd numbers and sieve out all the numbers whose factors are small primes (e.g. less equal than 64K-240K) then test the remaining numbers for primality.
If you want to find the primes in a range then use a sieve, the sieve of Erathostenes is very easy to implement but run slower and require more memory.
The sieve of Atkin is faster, the wheels sieve requires far less memory.
The size of the problem is exponential if approached naively so before micro-optimising is mandatory to first macro-optimise.
More or less all prime numbers algorithms require confidence with Number theory, so take particular attention to the group/ring/field the algorithm is working on because mathematicians write operations like the inverse or the multiplication with the same symbol for all the algebraic structures.
Once you have a fast algorithm, you can start micro-optimising.
At this level it's really impossible to answer how to proceed with such optimisations.
I was going through lecture videos by Robert Sedgwick on algorithms, and he explains that random shuffling ensures we don't get to encounter the worst case quadratic time scenario in quick sort. But I am unable to understand how.
It's really an admission that although we often talk about average case complexity, we don't in practice expect every case to turn up with the same probability.
Sorting an already sorted array is worst case in quicksort, because whenever you pick a pivot, you discover that all the elements get placed on the same side of the pivot, so you don't split into two roughly equal halves at all. And often in practice this already sorted case will turn up more often than other cases.
Randomly shuffling the data first is a quick way of ensuring that you really do end up with all cases turning up with equal probability, and therefore that this worst case will be as rare as any other case.
It's worth noting that there are other strategies that deal well with already sorted data, such as choosing the middle element as the pivot.
The assumption is that the worst case -- everything already sorted -- is frequent enough to be worth worrying about, and a shuffle is a black-magic least-effort sloppy way to avoid that case without having to admit that by improving that case you're moving the problem to another one, which happened to get randomly shuffled into sorted order. Hopefully that bad case is a much rarer situation, and even if it does come up the randomness means the problem can't easily be reproduced and blamed on this cheat.
The concept of improving a common case at the expense of a rare one is fine. The randomness as an alternative to actually thinking about which cases will be more or less common is somewhat sloppy.
In case of randomized QuickSort, since the pivot element is randomly chosen, we can expect the split of the input array to be reasonably well balanced on average - as opposed to the case of 1 and (n-1) split in a non randomized version of the algorithm. This helps in preventing the worst-case behavior of QuickSort which occurs in unbalanced partitioning.
Hence, the average case running time of the randomized version of QuickSort is O(nlogn) and not O(n^2);
What does a random shuffle do to the distribution on the input space? To understand this, let's look at a probability distribution, P, defined over a set S, where P is not in our control. Let us create a probability distribution P' by applying a random shuffle, over S to P. In other words, every time we get a sample from P, we map it, uniformly at random to an element of S. What can you say about this resulting distribution P'?
P'(x) = summation over all elements s in S of P(s)*1/|S| = 1/|S|
Thus, P' is just the uniform distribution over S. A random shuffle gives us control over the input probability distribution.
How is this relevant to quicksort? Well, we know the average complexity of quicksort. This is computed wrt the uniform probability distribution and that is a property we want to maintain on our input distribution, irrespective of what it really is. To achieve that, we do a random shuffle of our input array, ensuring that the distribution is not adversarial in any way.
Is the video in coursera?
Unfortunately, shuffle decrease performance to O(N^2) with data n,n,...,n,1,1,...,1.
I have inspected Quick.java with nn11.awk that generate such data.
$ for N in 10000 20000 30000 40000; do time ./nn11.awk $N | java Quick; done | awk 'NF>1'
real 0m10.732s
user 0m10.295s
sys 0m0.948s
real 0m48.057s
user 0m44.968s
sys 0m3.193s
real 1m52.109s
user 1m48.158s
sys 0m3.634s
real 3m38.336s
user 3m31.475s
sys 0m6.253s
I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?
You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.
I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.