Related
The famous Fisher-Yates shuffle algorithm can be used to randomly permute an array A of length N:
For k = 1 to N
Pick a random integer j from k to N
Swap A[k] and A[j]
A common mistake that I've been told over and over again not to make is this:
For k = 1 to N
Pick a random integer j from 1 to N
Swap A[k] and A[j]
That is, instead of picking a random integer from k to N, you pick a random integer from 1 to N.
What happens if you make this mistake? I know that the resulting permutation isn't uniformly distributed, but I don't know what guarantees there are on what the resulting distribution will be. In particular, does anyone have an expression for the probability distributions over the final positions of the elements?
An Empirical Approach.
Let's implement the erroneous algorithm in Mathematica:
p = 10; (* Range *)
s = {}
For[l = 1, l <= 30000, l++, (*Iterations*)
a = Range[p];
For[k = 1, k <= p, k++,
i = RandomInteger[{1, p}];
temp = a[[k]];
a[[k]] = a[[i]];
a[[i]] = temp
];
AppendTo[s, a];
]
Now get the number of times each integer is in each position:
r = SortBy[#, #[[1]] &] & /# Tally /# Transpose[s]
Let's take three positions in the resulting arrays and plot the frequency distribution for each integer in that position:
For position 1 the freq distribution is:
For position 5 (middle)
And for position 10 (last):
and here you have the distribution for all positions plotted together:
Here you have a better statistics over 8 positions:
Some observations:
For all positions the probability of
"1" is the same (1/n).
The probability matrix is symmetrical
with respect to the big anti-diagonal
So, the probability for any number in the last
position is also uniform (1/n)
You may visualize those properties looking at the starting of all lines from the same point (first property) and the last horizontal line (third property).
The second property can be seen from the following matrix representation example, where the rows are the positions, the columns are the occupant number, and the color represents the experimental probability:
For a 100x100 matrix:
Edit
Just for fun, I calculated the exact formula for the second diagonal element (the first is 1/n). The rest can be done, but it's a lot of work.
h[n_] := (n-1)/n^2 + (n-1)^(n-2) n^(-n)
Values verified from n=3 to 6 ( {8/27, 57/256, 564/3125, 7105/46656} )
Edit
Working out a little the general explicit calculation in #wnoise answer, we can get a little more info.
Replacing 1/n by p[n], so the calculations are hold unevaluated, we get for example for the first part of the matrix with n=7 (click to see a bigger image):
Which, after comparing with results for other values of n, let us identify some known integer sequences in the matrix:
{{ 1/n, 1/n , ...},
{... .., A007318, ....},
{... .., ... ..., ..},
... ....,
{A129687, ... ... ... ... ... ... ..},
{A131084, A028326 ... ... ... ... ..},
{A028326, A131084 , A129687 ... ....}}
You may find those sequences (in some cases with different signs) in the wonderful http://oeis.org/
Solving the general problem is more difficult, but I hope this is a start
The "common mistake" you mention is shuffling by random transpositions. This problem was studied in full detail by Diaconis and Shahshahani in Generating a random permutation with random transpositions (1981). They do a complete analysis of stopping times and convergence to uniformity. If you cannot get a link to the paper, then please send me an e-mail and I can forward you a copy. It's actually a fun read (as are most of Persi Diaconis's papers).
If the array has repeated entries, then the problem is slightly different. As a shameless plug, this more general problem is addressed by myself, Diaconis and Soundararajan in Appendix B of A Rule of Thumb for Riffle Shuffling (2011).
Let's say
a = 1/N
b = 1-a
Bi(k) is the probability matrix after i swaps for the kth element. i.e the answer to the question "where is k after i swaps?". For example B0(3) = (0 0 1 0 ... 0) and B1(3) = (a 0 b 0 ... 0). What you want is BN(k) for every k.
Ki is an NxN matrix with 1s in the i-th column and i-th row, zeroes everywhere else, e.g:
Ii is the identity matrix but with the element x=y=i zeroed. E.g for i=2:
Ai is
Then,
But because BN(k=1..N) forms the identity matrix, the probability that any given element i will at the end be at position j is given by the matrix element (i,j) of the matrix:
For example, for N=4:
As a diagram for N = 500 (color levels are 100*probability):
The pattern is the same for all N>2:
The most probable ending position for k-th element is k-1.
The least probable ending position is k for k < N*ln(2), position 1 otherwise
I knew I had seen this question before...
" why does this simple shuffle algorithm produce biased results? what is a simple reason? " has a lot of good stuff in the answers, especially a link to a blog by Jeff Atwood on Coding Horror.
As you may have already guessed, based on the answer by #belisarius, the exact distribution is highly dependent on the number of elements to be shuffled. Here's Atwood's plot for a 6-element deck:
What a lovely question! I wish I had a full answer.
Fisher-Yates is nice to analyze because once it decides on the first element, it leaves it alone. The biased one can repeatedly swap an element in and out of any place.
We can analyze this the same way we would a Markov chain, by describing the actions as stochastic transition matrices acting linearly on probability distributions. Most elements get left alone, the diagonal is usually (n-1)/n. On pass k, when they don't get left alone, they get swapped with element k, (or a random element if they are element k). This is 1/(n-1) in either row or column k. The element in both row and column k is also 1/(n-1). It's easy enough to multiply these matrices together for k going from 1 to n.
We do know that the element in last place will be equally likely to have originally been anywhere because the last pass swaps the last place equally likely with any other. Similarly, the first element will be equally likely to be placed anywhere. This symmetry is because the transpose reverses the order of matrix multiplication. In fact, the matrix is symmetric in the sense that row i is the same as column (n+1 - i). Beyond that, the numbers don't show much apparent pattern. These exact solutions do show agreement with the simulations run by belisarius: In slot i, The probability of getting j decreases as j raises to i, reaching its lowest value at i-1, and then jumping up to its highest value at i, and decreasing until j reaches n.
In Mathematica I generated each step with
step[k_, n_] := Normal[SparseArray[{{k, i_} -> 1/n,
{j_, k} -> 1/n, {i_, i_} -> (n - 1)/n} , {n, n}]]
(I haven't found it documented anywhere, but the first matching rule is used.)
The final transition matrix can be calculated with:
Fold[Dot, IdentityMatrix[n], Table[step[m, n], {m, s}]]
ListDensityPlot is a useful visualization tool.
Edit (by belisarius)
Just a confirmation. The following code gives the same matrix as in #Eelvex's answer:
step[k_, n_] := Normal[SparseArray[{{k, i_} -> (1/n),
{j_, k} -> (1/n), {i_, i_} -> ((n - 1)/n)}, {n, n}]];
r[n_, s_] := Fold[Dot, IdentityMatrix[n], Table[step[m, n], {m, s}]];
Last#Table[r[4, i], {i, 1, 4}] // MatrixForm
Wikipedia's page on the Fisher-Yates shuffle has a description and example of exactly what will happen in that case.
You can compute the distribution using stochastic matrices. Let the matrix A(i,j) describe the probability of the card originally at position i ending up in position j. Then the kth swap has a matrix Ak given by Ak(i,j) = 1/N if i == k or j == k, (the card in position k can end up anywhere and any card can end up at position k with equal probability), Ak(i,i) = (N - 1)/N for all i != k (every other card will stay in the same place with probability (N-1)/N) and all other elements zero.
The result of the complete shuffle is then given by the product of the matrices AN ... A1.
I expect you're looking for an algebraic description of the probabilities; you can get one by expanding out the above matrix product, but it I imagine it will be fairly complex!
UPDATE: I just spotted wnoise's equivalent answer above! oops...
I've looked into this further, and it turns out that this distribution has been studied at length. The reason it's of interest is because this "broken" algorithm is (or was) used in the RSA chip system.
In Shuffling by semi-random transpositions, Elchanan Mossel, Yuval Peres, and Alistair Sinclair study this and a more general class of shuffles. The upshot of that paper appears to be that it takes log(n) broken shuffles to achieve near random distribution.
In The bias of three pseudorandom shuffles (Aequationes Mathematicae, 22, 1981, 268-292), Ethan Bolker and David Robbins analyze this shuffle and determine that the total variation distance to uniformity after a single pass is 1, indicating that it is not very random at all. They give asympotic analyses as well.
Finally, Laurent Saloff-Coste and Jessica Zuniga found a nice upper bound in their study of inhomogeneous Markov chains.
This question is begging for an interactive visual matrix diagram analysis of the broken shuffle mentioned. Such a tool is on the page Will It Shuffle? - Why random comparators are bad by Mike Bostock.
Bostock has put together an excellent tool that analyzes random comparators. In the dropdown on that page, choose naïve swap (random ↦ random) to see the broken algorithm and the pattern it produces.
His page is informative as it allows one to see the immediate effects a change in logic has on the shuffled data. For example:
This matrix diagram using a non-uniform and very-biased shuffle is produced using a naïve swap (we pick from "1 to N") with code like this:
function shuffle(array) {
var n = array.length, i = -1, j;
while (++i < n) {
j = Math.floor(Math.random() * n);
t = array[j];
array[j] = array[i];
array[i] = t;
}
}
But if we implement a non-biased shuffle, where we pick from "k to N" we should see a diagram like this:
where the distribution is uniform, and is produced from code such as:
function FisherYatesDurstenfeldKnuthshuffle( array ) {
var pickIndex, arrayPosition = array.length;
while( --arrayPosition ) {
pickIndex = Math.floor( Math.random() * ( arrayPosition + 1 ) );
array[ pickIndex ] = [ array[ arrayPosition ], array[ arrayPosition ] = array[ pickIndex ] ][ 0 ];
}
}
The excellent answers given so far are concentrating on the distribution, but you have asked also "What happens if you make this mistake?" - which is what I haven't seen answered yet, so I'll give an explanation on this:
The Knuth-Fisher-Yates shuffle algorithm picks 1 out of n elements, then 1 out of n-1 remaining elements and so forth.
You can implement it with two arrays a1 and a2 where you remove one element from a1 and insert it into a2, but the algorithm does it in place (which means, that it needs only one array), as is explained here (Google: "Shuffling Algorithms Fisher-Yates DataGenetics") very well.
If you don't remove the elements, they can be randomly chosen again which produces the biased randomness. This is exactly what the 2nd example your are describing does. The first example, the Knuth-Fisher-Yates algorithm, uses a cursor variable running from k to N, which remembers which elements have already been taken, hence avoiding to pick elements more than once.
Given N points on a 2D plane, determine if there is a line that divides them into two sets of N / 2 points each.
There are two more rules:
The sum of the distances of each set of points to this line should be the same.
The line can't pass through any of the points.
Extras (not sure if helps):
We can assume that N is large (~100k); -2000 <= x[i], y[i] <= 2000
Do you folks have any insights to this problem ? I really tried many stuff but I believe that I should use some sort of equality, or prove something like: sum(distancesSet1[i]) = sum(distancesSet2[i]).
If you want, I can also post here the stuff that I tried and failed (or I think it failed), but before I'd like to see your suggestions.
Thank you so much!
#Edit:
What I need to know for this problem is to exactly say whether it's possible or not given the set of N points.
Update: This was an attempt to answer the initial, more general question of whether it was possible to divide the points or not.
The problem as defined by your constraints is mathematically unsolvable. You can't guarantee that the sums of the distances will be equal for both sets.
All you need as proof is a counterexample:
S = [[-1000,0], [0,0], [1,0], [2,0]]
There is only one possible combination to separate the pairs:
S1 = [[-1000,0], [0,0]]
S2 = [[1,0], [2,0]]
All points are on a line L1. Given your bullet #2 we can conclude that any line L2 that separate those points will form an angle t wrt L1. The sum of the distances are then:
sum1 = a*sin(t) :: 1000 < a < 1002
sum2 = b*sin(t) :: 1 < b < 3
t != 0
sum1 > sum2
QED
Given two sorted arrays of integers A1, A2, with the same length n and an integer x, I need to write an algorithm that runs in O(nlog(n)) that determines whether there exist two elements a1, a2 (one element in each array) that make a1+a2=x.
At first I thought about having two index iterators i1=0, i2=0 (one for each array) that start from 0 and increase one at a time, depending on the next element of A1 being bigger/smaller than the next element of A2. But after testing it on two arrays I found out that it might miss some possible solutions...
Well, as they are both sorted already, the algorithm should be O(n) (sorting would be O(n * log(n))):
i1 = 0
i2 = A2.size - 1
while i1 < A1.size and i2 >= 0
if A1[i1] + A2[i2] < x
++i1
else if A1[i1] + A2[i2] > x
--i2
else
success!!!!
This is a strange question because there is an inelegant solution in time O(N Lg N) (for every element of A1, lookup A2 for x-a1 by dichotomic search), and a nice one requiring only O(N) operations.
Start from the left of A1 and the right of A2 and move left in A2 as long as a1+a2≥x. Then move right one position in A1 and update in A2 if needed...
You start one array from index = 0 = i and the other in reverse other = j.
First step you know you have the smallest in the list A and the biggest in the list B so you subtract i from x, then move the j index down until the value =< X
Basically you have 2 indexes moving towards the middle
If value of index i > value of index j then there is no such sum that match x.
I have to design an algorithm with running time O(nlogn) for the following problem:
Given a set P of n points, determine a value A > 0 such that the shear transformation (x,y) -> (x+Ay,y) does not change the order (in x direction) of points with unequal x-coordinates.
I am having a lot of difficulty even figuring out where to begin.
Any help with this would be greatly appreciated!
Thank you!
I think y = 0.
When x = 0, A > 0
(x,y) -> (x+Ay,y)
-> (0+(A*0),0) = (0,0)
When x = 1, A > 0
(x,y) -> (x+Ay,y)
-> (1+(A*0),0) = (1,0)
with unequal x-coordinates, (2,0), (3,0), (4,0)...
So, I think that the begin point may be (0,0), x=0.
Suppose all x,y coordinates are positive numbers. (Without loss of generality, one can add offsets.) In time O(n log n), sort a list L of the points, primarily in ascending order by x coordinates and secondarily in ascending order by y coordinates. In time O(n), process point pairs (in L order) as follows. Let p, q be any two consecutive points in L, and let px, qx, py, qy denote their x and y coordinate values. From there you just need to consider several cases and it should be obvious what to do: If px=qx, do nothing. Else, if py<=qy, do nothing. Else (px>qx, py>qy) require that px + A*py < qx + A*qy, i.e. (px-qx)/(py-qy) > A.
So: Go through L in order, and find the largest A' that is satisfied for all point pairs where px>qx and py>qy. Then choose a value of A that's a little less than A', for example, A'/2. (Or, if the object of the problem is to find the largest such A, just report the A' value.)
Ok, here's a rough stab at a method.
Sort the list of points by x order. (This gives the O(nlogn)--all the following steps are O(n).)
Generate a new list of dx_i = x_(i+1) - x_i, the differences between the x coordinates. As the x_i are ordered, all of these dx_i >= 0.
Now for some A, the transformed dx_i(A) will be x_(i+1) -x_i + A * ( y_(i+1) - y_i). There will be an order change if this is negative or zero (x_(i+1)(A) < x_i(A).
So for each dx_i, find the value of A that would make dx_i(A) zero, namely
A_i = - (x_(i+1) - x_i)/(y_(i+1) - y_i). You now have a list of coefficients that would 'cause' an order swap between a consecutive (in x-order) pair of points. Watch for division by zero, but that's the case where two points have the same y, these points will not change order. Some of the A_i will be negative, discard these as you want A>0. (Negative A_i will also induce an order swap, so the A>0 requirement is a little arbitrary.)
Find the smallest A_i > 0 in the list. So any A with 0 < A < A_i(min) will be a shear that does not change the order of your points. Pick A_i(min) as that will bring two points to the same x, but not past each other.
There are two integer sequences A[] and B[] of length N,both unsorted.
Requirement: through the swapping of elements between A[] and B[]( can randomly exchange, not with same index), make the difference between {the sum of all elements in A[]} and {the sum of all elements in B[]} to be minimum.
PS: actually,it is an interview question I encountered.
Many thanks
This is going to be NP-hard! I believe you can do a reduction from Subset Sum to this.
As per BlueRaja/polygene's comments, I will try to provide a full reduction from Subset Sum.
Here is a reduction:
Subset Sum problem: Given integers x1, x2, ..., xn, is there some non-empty subset which sums to zero?
Our problem: Given two integer arrays of size k, find the minimum possible difference of the sum of the two arrays, assuming we can shuffle around the integers in the arrays, treating both arrays as one array.
Say we had a polynomial time algo for our problem.
Say now you are given integers T = {x1,x2, ...,xn} (multiset)
Let Si = x1 + x2 + ...+ xn + xi.
Let Ti = {x1, x2, ..., xi-1, xi+1, ..., xn } ( = T - xi)
Define
Ai = Array formed using Ti
Bi = [Si, 0, ..., 0] (i.e one element is Si and rest are zeroes).
Let mi = the min difference found by our problem for arrays Ai and Bi
(we run our problem n times).
Claim: Some non-empty subset of T sums to zero if and only if, there is some i, for which mi = 0.
Proof: (wlog) say x1 + x2 + .. + xk = 0
Then
A = [xk+1, ..., xn, 0, ...0]
B = [x2, x3, ..., xk, S1, 0, ..0]
gives the minimum difference m1 to be |x2 + .. + xk + (x1 + ... + xn) + x1 - (xk+1 + .. + xn)| = |2(x1+ x2 + .. xk)| = 0.
Similarly the if part can be proved.
In fact, this actually also follows (more easily) from Partition too: just create new array with all zeroes.
Hoepfully I haven't made any mistakes.
Take any instance of the NP-complete partition problem:
Partition a multiset A of positive integers into two multisets B and C with the same sum
like {a1,a2,...,an}. Add n zeroes {0,0,0...,0,a1,...,an} and ask if the set can be partitioned into two multisets A and B with the same sum and same number of elements. I claim these two conditions are equivalent:
If A and B are a solution to the problem, then you can strike out the zeroes and get a solution of partiton problem.
If there is a solution to the partition problem, for example ai1 + ai2 + ... aik = aj1 + ... +ajl where {ai1, ai2, aik, aj1, ..., ajl} = {a1, ... , an} then obviously k+l = n. Add l zeroes to the left side and k zeroes to the right side and you'll get 0 + ... + 0 + ai1 + ai2 + ... aik = 0 + ... + 0 + aj1 + ... +ajl, whichi is a solution of your problem.
So, this is a reduction (so the problem is NP-hard) and the problem is NP, so it is NP-complete.
"sequences A[] and B[] of length N" -> does this mean both A and B are each of length N?
(For the purpose of clarity I am using 1-based arrays below).
If so, how about this:
Assume A[1..N] and B[1..N]
Concatenate A and B into a new array C of length 2N: C[1..N] <- A[1..N]; C[N+1 .. 2N] <- B[1..N]
Sort C in ascending order.
Take the first pair of numbers from C; send the first element (C[1]) to A[1] and second element (C[2]) to B[1]
Take the second pair of numbers from C; this time send the second element (C[4]) to A[2] and the first element (C[3]) to B[2] (the order of elements in the pair sent to A and B is the opposite of 3)
... repeat 3 and 4 until C is exhausted
The observation here is that, in a sorted array, an adjacent pair of numbers will have the smallest difference (compared to a pair of numbers from non-adjacent positions). Step 3 ensures that A[1] and B[1] consists of a pair of numbers with the least possible difference. Step 4 ensures that (a) A[2] and B[2] consist of a pair of numbers with the least possible difference (from the available numbers) and also (b) that the difference is opposite in sign from step 3. By continuing like this, we are ensuring that A[i] and B[i] contain numbers with the least possible difference. Also, by flipping the order in which we send elements to A and B, we are ensuring that the difference changes sign for each successive i.
Try being greedy about it. Given such limited information, I'm not sure what else one could put out there.
I'm not sure that this will ensure the minimum possible distance, but the first thing that comes to mi mind is something like this:
int diff=0;
for (int i = 0; i<len; i++){
int x = a[i] - b[i];
if (abs(diff - x) > abs(diff + x)){
swap(a,b,i);
diff-=x;
}else{
diff+=x;
}
}
assuming that you have a swap function which takes the two arrays and exchanges the items at position i :)
computing and adding the difference between the two values at position i you get the incremental difference between the sums of the elements of the two arrays.
at each step you check if it's better to add (a[i]-b[i]) or (b[i]-a[i]). if the b[i]-a[i] it's the case, you swap the elements at position i in the arrays.
Maybe this will not be the best way, but it should be a start :)
The problem is NP-Complete.
We can reduce the partition problem to the decision version of this problem, i.e. given two arrays of ints of the same size, determine whether items can be swapped so that the sums are equal.
The input to the partition problem: a set S of integers, of size N
In order to transform this input into an input to our problem, we define A to be an array of all items in S, and B an array of the same size, with B[i]=0 for all i. This transformation is linear in the input size.
It is clear that our algorithm applied on A and B returns true if and only if there is a partition of S into 2 subsets such that the sums are equal.