Let's say I have a given length c and I need to cut out several pieces of different lengths a{i}, where i is the index of a specific piece. The length of every piece is smaller or equal to the length c. I need to find all possible permutations of cutting patterns.
Does someone has a smart approach for such tasks or an algorithm to solve this?
The function could look something similar to this:
Pattern[] getPatternList(double.. a, double c);
The input is hence a list of different sizes and the total available space. My goal is to optimize/minimize the trim loss.
I'll use the simplex algorithm for that but to create an linear programming model, I need a smart way to determine all the cutting patterns.
There are exponentially many cutting-patterns in general. So it might not be feasible to construct them all (time and memory)
If you need to optimize some cutting based on some objective, enumerating all possible cuttings is a bad approach (like #harold mentioned)
A bad analogy (which does not exactly apply here as your base-problem is np-hard):
solving 2-SAT is possible in polynomial-time
enumerating all 2-SAT solutions is Sharp-P-complete (an efficient algorithm would imply P=NP, so there might be none!)
A simple approach (to generate all valid cutting-patterns):
Generate all permutations if items = ordering of items (bounded by !n)
Place them one after another and stop if c is exceeded
(It would be a good idea to do this incrementally; build one permutation after another)
Assumption: each item can only be selected once
Assumption: moving/shifting a cut within a free range does not generate a new solution. It it would: solution-space is possibly an uncountably infinite set
edit
Code
Here is a more powerful approach handling the problem with the same assumptions as described above. It uses integer-programming to minimize the trim-loss, implemented in python with the use of cvxpy (and a commercial-solver; can be replaced by an open-source solver like cbc):
import numpy as np
from cvxpy import *
np.random.seed(1)
# random problem
SPACE = 25000
N_ITEMS = 10000
items = np.random.randint(0, 10, size=N_ITEMS)
def minimize_loss(items, space):
N = items.shape[0]
X = Bool(N)
constraint = [sum_entries(mul_elemwise(items, X)) <= space]
objective = Minimize(space - sum_entries(mul_elemwise(items, X)))
problem = Problem(objective, constraint)
problem.solve(solver=GUROBI, verbose=True)
print('trim-loss: ', problem.value)
print('validated trim-loss: ', space - sum(np.dot(X.value.flatten(), items)))
print('# selected items: ', np.count_nonzero(np.round(X.value)))
print('items: ', items)
print('space: ', SPACE)
minimize_loss(items, SPACE)
Output
items: [5 8 9 ..., 5 3 5]
space: 25000
Parameter OutputFlag unchanged
Value: 1 Min: 0 Max: 1 Default: 1
Changed value of parameter QCPDual to 1
Prev: 0 Min: 0 Max: 1 Default: 0
Optimize a model with 1 rows, 10000 columns and 8987 nonzeros
Coefficient statistics:
Matrix range [1e+00, 9e+00]
Objective range [1e+00, 9e+00]
Bounds range [1e+00, 1e+00]
RHS range [2e+04, 2e+04]
Found heuristic solution: objective -25000
Presolve removed 1 rows and 10000 columns
Presolve time: 0.01s
Presolve: All rows and columns removed
Explored 0 nodes (0 simplex iterations) in 0.01 seconds
Thread count was 1 (of 4 available processors)
Optimal solution found (tolerance 1.00e-04)
Best objective -2.500000000000e+04, best bound -2.500000000000e+04, gap 0.0%
trim-loss: 0.0
validated trim-loss: [[ 0.]]
# selected items: 6516
edit v2
After read your new comments, it is clear, that your model-description was incomplete/imprecise and nothing above tackles the problem you want to solve. It's a bit sad.
You will need to enumerate all permutations of a, and then take the longest prefix that has length less than or equal to c.
This sounds like a version of the knapsack problem (https://en.wikipedia.org/wiki/Knapsack_problem), and nobody knows an efficient way to do this.
Related
I have this question :
Airline company has N different planes and T pilots. Every pilot has a list of planes he can fly. Every flight needs 2 pilots. The company want to have as much flights simultaneously as possible. Find an algorithm that finds if you can have all the flights simultaneously.
This is the solution I thought about is finding max flow on this graph:
I am just not sure what the capacity should be. Can you help me with that?
Great idea to find the max flow.
For each edge from source --> pilot, assign a capacity of 1. Each pilot can only fly one plane at a time since they are running simultaneously.
For each edge from pilot --> plane, assign a capacity of 1. If this edge is filled with flow of 1, it represents that the given pilot is flying that plane.
For each edge from plane --> sink, assign a capacity of 2. This represents that each plane must be supplied by exactly 2 pilots.
Now, find a maximum flow. If the resulting maximum flow is two times the number of planes, then it's possible to satisfy the constraints. In this case, the edges between planes and pilots that are at capacity represent the matching.
The other answer is fine but you don't really need to involve flow as this can be reduced just as well to ordinary maximum bipartite matching:
For each plane, add another auxiliary plane to the plane partition with edges to the same pilots as the first plane.
Find a maximum bipartite matching M.
The answer is now true if and only if M = 2 N.
If you like, you can think of this as saying that each plane needs a pilot and a co-pilot, and the two vertices associated to each plane now represents those two roles.
The reduction to maximum bipartite matching is linear time, so using e.g. the Hopcroft–Karp algorithm to find the matching, you can solve the problem in O(|E| √|V|) where E is the number of edges between the partitions, and V = T + N.
In practice, the improvement over using a maximum flow based approach should depend on the quality of your implementations as well as the particular choice of representation of the graph, but chances are that you're better off this way.
Implementation example
To illustrate the last point, let's give an idea of how the two reductions could look in practice. One representation of a graph that's often useful due to its built-in memory locality is that of a CSR matrix, so let us assume that the input is such a matrix, whose rows correspond to the planes, and whose columns correspond to the pilots.
We will use the Python library SciPy which comes with algorithms for both maximum bipartite matching and maximum flow, and which works with CSR matrix representations for graphs under the hood.
In the algorithm given above, we will then need to construct the biadjacency matrix of the graph with the additional vertices added. This is nothing but the result of stacking the input matrix on top of itself, which is straightforward to phrase in terms of the CSR data structures: Following Wikipedia's notation, COL_INDEX should just be repeated, and ROW_INDEX should be replaced with ROW_INDEX concatenated with a copy of ROW_INDEX in which all elements are increased by the final element of ROW_INDEX.
In SciPy, a complete implementation which answers yes or no to the problem in OP would look as follows:
import numpy as np
from scipy.sparse.csgraph import maximum_bipartite_matching
def reduce_to_max_matching(a):
i, j = a.shape
data = np.ones(a.nnz * 2, dtype=bool)
indices = np.concatenate([a.indices, a.indices])
indptr = np.concatenate([a.indptr, a.indptr[1:] + a.indptr[-1]])
graph = csr_matrix((data, indices, indptr), shape=(2*i, j))
return (maximum_bipartite_matching(graph) != -1).sum() == 2 * i
In the maximum flow approach given by #HeatherGuarnera's answer, we will need to set up the full adjacency matrix of the new graph. This is also relatively straightforward; the input matrix will appear as a certain submatrix of the adjacency matrix, and we need to add a row for the source vertex and a column for the target. The example section of the documentation for SciPy's max flow solver actually contains an illustration of what this looks like in practice. Adopting this, a complete solution looks as follows:
import numpy as np
from scipy.sparse.csgraph import maximum_flow
def reduce_to_max_flow(a):
i, j = a.shape
n = a.nnz
data = np.concatenate([2*np.ones(i, dtype=int), np.ones(n + j, dtype=int)])
indices = np.concatenate([np.arange(1, i + 1),
a.indices + i + 1,
np.repeat(i + j + 1, j)])
indptr = np.concatenate([[0],
a.indptr + i,
np.arange(n + i + 1, n + i + j + 1),
[n + i + j]])
graph = csr_matrix((data, indices, indptr), shape=(2+i+j, 2+i+j))
flow = maximum_flow(graph, 0, graph.shape[0]-1)
return flow.flow_value == 2*i
Let us compare the timings of the two approaches on a single example consisting of 40 planes and 100 pilots, on a graph whose edge density is 0.1:
from scipy.sparse import random
inp = random(40, 100, density=.1, format='csr', dtype=bool)
%timeit reduce_to_max_matching(inp) # 191 µs ± 3.57 µs per loop
%timeit reduce_to_max_flow(inp) # 1.29 ms ± 20.1 µs per loop
The matching-based approach is faster, but not by a crazy amount. On larger problems, we'll start to see the advantages of using matching instead; with 400 planes and 1000 pilots:
inp = random(400, 1000, density=.1, format='csr', dtype=bool)
%timeit reduce_to_max_matching(inp) # 473 µs ± 5.52 µs per loop
%timeit reduce_to_max_flow(inp) # 68.9 ms ± 555 µs per loop
Again, this exact comparison relies on the use of specific predefined solvers from SciPy and how those are implemented, but if nothing else, this hints that simpler is better.
I have a problem statement which says: if you have an array of elements {x1,x2,x3,...x10}, find the combination of elements such that it just sums up above a threshold value (say the threshold value is 100).
So if there exists a combination like x2+x5+x8 = 105, x3+x5+x8=103, and x4+x5 = 101, then the algorithm should output X4, X5.
The knapsack algorithm emits a value that is near but on the lesser side of the threshold (which is 100 here). I want the opposite, that is the smallest sum of selected elements that is greater than 100.
Is there any set of algorithms or any special case of any algorithm which might solve this problem?
I'll start out by noting that you are asking for the smallest value strictly greater than some target. In general "strictly greater than" and "strictly less than" constraints are much harder than "greater than or equal to" or "less than or equal to" constraints. If you have all integer values, then you could simply translate your constraint "the sum exceeds 100" to "the sum is greater than or equal to 101". I'll assume that you've made such a transformation for the rest of the problem.
One approach would be to treat this as an integer optimization problem, in which the binary decision variable y_i for each number is whether or not we include it. Then our goal is to minimize the sum of the numbers, which can be modeled as:
min x_1*y_1 + x_2*y_2 + ... + x_n*y_n
The constraint in this case is that the sum of the numbers is at least 100:
x_1*y_1 + x_2*y_2 + ... + x_n*y_n >= 100
In general this is a hard problem (note that it is at least as hard as the subset sum problem, which is NP-complete). However modern optimization solvers may be efficient enough for your problem instances.
To test the scalability of a free solver for this problem, consider the following implementation with the lpSolve package in R (it returns the selected subset if the problem is feasible and NA otherwise):
library(lpSolve)
min.subset <- function(x, min.sum) {
mod <- lp("min", x, matrix(x, nrow=1), ">=", min.sum, all.bin=TRUE)
if (mod$status == 0) {
which(mod$solution >= 0.999)
} else {
NA
}
}
min.subset(1:10, 43.5)
# [1] 2 3 4 5 6 7 8 9
min.subset(1:10, 88)
# [1] NA
To test the scalability, I'll select n elements randomly from [1, 2, ..., 1000], setting the target sum to be half the sum of the elements. The runtimes were:
With n=100, it ran in 0.01 seconds
With n=1000, it ran in 0.1 seconds
With n=10000, it ran in 8.7 seconds
It appears you can solve this problem for more than 10k elements (with the selected distribution) without too many computational challenges. If your problem is too big for the free solver I've used here, you might consider Gurobi or cplex, two commercial solvers that are free for academic use but otherwise not free.
Suppose X is the sum of all x_i. Then equivalently, you are asking for a minimum subset of your x_i that sum up to at most X - 100 (as the complement of these x_i will be the optimum solution to your problem). So all Knapsack theory can be applied here.
In practice (really large instances), I'd suggest this form of Nemhauser-Ullman generalization which can solve instances with millions of objects.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm trying to find a way to find similarities in two arrays of different points. I drew circles around points that have similar patterns and I would like to do some kind of auto comparison in intervals of let's say 100 points and tell what coefficient of similarity is for that interval. As you can see it might not be perfectly aligned also so point-to-point comparison would not be a good solution also (I suppose). Patterns that are slightly misaligned could also mean that they are matching the pattern (but obviously with a smaller coefficient)
What similarity could mean (1 coefficient is a perfect match, 0 or less - is not a match at all):
Points 640 to 660 - Very similar (coefficient is ~0.8)
Points 670 to 690 - Quite similar (coefficient is ~0.5-~0.6)
Points 720 to 780 - Let's say quite similar (coefficient is ~0.5-~0.6)
Points 790 to 810 - Perfectly similar (coefficient is 1)
Coefficient is just my thoughts of how a final calculated result of comparing function could look like with given data.
I read many posts on SO but it didn't seem to solve my problem. I would appreciate your help a lot. Thank you
P.S. Perfect answer would be the one that provides pseudo code for function which could accept two data arrays as arguments (intervals of data) and return coefficient of similarity.
Click here to see original size of image
I also think High Performance Mark has basically given you the answer (cross-correlation). In my opinion, most of the other answers are only giving you half of what you need (i.e., dot product plus compare against some threshold). However, this won't consider a signal to be similar to a shifted version of itself. You'll want to compute this dot product N + M - 1 times, where N, M are the sizes of the arrays. For each iteration, compute the dot product between array 1 and a shifted version of array 2. The amount you shift array 2 increases by one each iteration. You can think of array 2 as a window you are passing over array 1. You'll want to start the loop with the last element of array 2 only overlapping the first element in array 1.
This loop will generate numbers for different amounts of shift, and what you do with that number is up to you. Maybe you compare it (or the absolute value of it) against a threshold that you define to consider two signals "similar".
Lastly, in many contexts, a signal is considered similar to a scaled (in the amplitude sense, not time-scaling) version of itself, so there must be a normalization step prior to computing the cross-correlation. This is usually done by scaling the elements of the array so that the dot product with itself equals 1. Just be careful to ensure this makes sense for your application numerically, i.e., integers don't scale very well to values between 0 and 1 :-)
i think HighPerformanceMarks's suggestion is the standard way of doing the job.
a computationally lightweight alternative measure might be a dot product.
split both arrays into the same predefined index intervals.
consider the array elements in each intervals as vector coordinates in high-dimensional space.
compute the dot product of both vectors.
the dot product will not be negative. if the two vectors are perpendicular in their vector space, the dot product will be 0 (in fact that's how 'perpendicular' is usually defined in higher dimensions), and it will attain its maximum for identical vectors.
if you accept the geometric notion of perpendicularity as a (dis)similarity measure, here you go.
caveat:
this is an ad hoc heuristic chosen for computational efficiency. i cannot tell you about mathematical/statistical properties of the process and separation properties - if you need rigorous analysis, however, you'll probably fare better with correlation theory anyway and should perhaps forward your question to math.stackexchange.com.
My Attempt:
Total_sum=0
1. For each index i in the range (m,n)
2. sum=0
3. k=Array1[i]*Array2[i]; t1=magnitude(Array1[i]); t2=magnitude(Array2[i]);
4. k=k/(t1*t2)
5. sum=sum+k
6. Total_sum=Total_sum+sum
Coefficient=Total_sum/(m-n)
If all values are equal, then sum would return 1 in each case and total_sum would return (m-n)*(1). Hence, when the same is divided by (m-n) we get the value as 1. If the graphs are exact opposites, we get -1 and for other variations a value between -1 and 1 is returned.
This is not so efficient when the y range or the x range is huge. But, I just wanted to give you an idea.
Another option would be to perform an extensive xnor.
1. For each index i in the range (m,n)
2. sum=1
3. k=Array1[i] xnor Array2[i];
4. k=k/((pow(2,number_of_bits))-1) //This will scale k down to a value between 0 and 1
5. sum=(sum+k)/2
Coefficient=sum
Is this helpful ?
You can define a distance metric for two vectors A and B of length N containing numbers in the interval [-1, 1] e.g. as
sum = 0
for i in 0 to 99:
d = (A[i] - B[i])^2 // this is in range 0 .. 4
sum = (sum / 4) / N // now in range 0 .. 1
This now returns distance 1 for vectors that are completely opposite (one is all 1, another all -1), and 0 for identical vectors.
You can translate this into your coefficient by
coeff = 1 - sum
However, this is a crude approach because it does not take into account the fact that there could be horizontal distortion or shift between the signals you want to compare, so let's look at some approaches for coping with that.
You can sort both your arrays (e.g. in ascending order) and then calculate the distance / coefficient. This returns more similarity than the original metric, and is agnostic towards permutations / shifts of the signal.
You can also calculate the differentials and calculate distance / coefficient for those, and then you can do that sorted also. Using differentials has the benefit that it eliminates vertical shifts. Sorted differentials eliminate horizontal shift but still recognize different shapes better than sorted original data points.
You can then e.g. average the different coefficients. Here more complete code. The routine below calculates coefficient for arrays A and B of given size, and takes d many differentials (recursively) first. If sorted is true, the final (differentiated) array is sorted.
procedure calc(A, B, size, d, sorted):
if (d > 0):
A' = new array[size - 1]
B' = new array[size - 1]
for i in 0 to size - 2:
A'[i] = (A[i + 1] - A[i]) / 2 // keep in range -1..1 by dividing by 2
B'[i] = (B[i + 1] - B[i]) / 2
return calc(A', B', size - 1, d - 1, sorted)
else:
if (sorted):
A = sort(A)
B = sort(B)
sum = 0
for i in 0 to size - 1:
sum = sum + (A[i] - B[i]) * (A[i] - B[i])
sum = (sum / 4) / size
return 1 - sum // return the coefficient
procedure similarity(A, B, size):
sum a = 0
a = a + calc(A, B, size, 0, false)
a = a + calc(A, B, size, 0, true)
a = a + calc(A, B, size, 1, false)
a = a + calc(A, B, size, 1, true)
return a / 4 // take average
For something completely different, you could also run Fourier transform using FFT and then take a distance metric on the returning spectra.
Say S = 5 and N = 3 the solutions would look like - <0,0,5> <0,1,4> <0,2,3> <0,3,2> <5,0,0> <2,3,0> <3,2,0> <1,2,2> etc etc.
In the general case, N nested loops can be used to solve the problem. Run N nested loop, inside them check if the loop variables add upto S.
If we do not know N ahead of time, we can use a recursive solution. In each level, run a loop starting from 0 to N, and then call the function itself again. When we reach a depth of N, see if the numbers obtained add up to S.
Any other dynamic programming solution?
Try this recursive function:
f(s, n) = 1 if s = 0
= 0 if s != 0 and n = 0
= sum f(s - i, n - 1) over i in [0, s] otherwise
To use dynamic programming you can cache the value of f after evaluating it, and check if the value already exists in the cache before evaluating it.
There is a closed form formula : binomial(s + n - 1, s) or binomial(s+n-1,n-1)
Those numbers are the simplex numbers.
If you want to compute them, use the log gamma function or arbitrary precision arithmetic.
See https://math.stackexchange.com/questions/2455/geometric-proof-of-the-formula-for-simplex-numbers
I have my own formula for this. We, together with my friend Gio made an investigative report concerning this. The formula that we got is [2 raised to (n-1) - 1], where n is the number we are looking for how many addends it has.
Let's try.
If n is 1: its addends are o. There's no two or more numbers that we can add to get a sum of 1 (excluding 0). Let's try a higher number.
Let's try 4. 4 has addends: 1+1+1+1, 1+2+1, 1+1+2, 2+1+1, 1+3, 2+2, 3+1. Its total is 7.
Let's check with the formula. 2 raised to (4-1) - 1 = 2 raised to (3) - 1 = 8-1 =7.
Let's try 15. 2 raised to (15-1) - 1 = 2 raised to (14) - 1 = 16384 - 1 = 16383. Therefore, there are 16383 ways to add numbers that will equal to 15.
(Note: Addends are positive numbers only.)
(You can try other numbers, to check whether our formula is correct or not.)
This can be calculated in O(s+n) (or O(1) if you don't mind an approximation) in the following way:
Imagine we have a string with n-1 X's in it and s o's. So for your example of s=5, n=3, one example string would be
oXooXoo
Notice that the X's divide the o's into three distinct groupings: one of length 1, length 2, and length 2. This corresponds to your solution of <1,2,2>. Every possible string gives us a different solution, by counting the number of o's in a row (a 0 is possible: for example, XoooooX would correspond to <0,5,0>). So by counting the number of possible strings of this form, we get the answer to your question.
There are s+(n-1) positions to choose for s o's, so the answer is Choose(s+n-1, s).
There is a fixed formula to find the answer. If you want to find the number of ways to get N as the sum of R elements. The answer is always:
(N+R-1)!/((R-1)!*(N)!)
or in other words:
(N+R-1) C (R-1)
This actually looks a lot like a Towers of Hanoi problem, without the constraint of stacking disks only on larger disks. You have S disks that can be in any combination on N towers. So that's what got me thinking about it.
What I suspect is that there is a formula we can deduce that doesn't require the recursive programming. I'll need a bit more time though.
You have a biased random number generator that produces a 1 with a probability p and 0 with a probability (1-p). You do not know the value of p. Using this make an unbiased random number generator which produces 1 with a probability 0.5 and 0 with a probability 0.5.
Note: this problem is an exercise problem from Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein.(clrs)
The events (p)(1-p) and (1-p)(p) are equiprobable. Taking them as 0 and 1 respectively and discarding the other two pairs of results you get an unbiased random generator.
In code this is done as easy as:
int UnbiasedRandom()
{
int x, y;
do
{
x = BiasedRandom();
y = BiasedRandom();
} while (x == y);
return x;
}
The procedure to produce an unbiased coin from a biased one was first attributed to Von Neumann (a guy who has done enormous work in math and many related fields). The procedure is super simple:
Toss the coin twice.
If the results match, start over, forgetting both results.
If the results differ, use the first result, forgetting the second.
The reason this algorithm works is because the probability of getting HT is p(1-p), which is the same as getting TH (1-p)p. Thus two events are equally likely.
I am also reading this book and it asks the expected running time. The probability that two tosses are not equal is z = 2*p*(1-p), therefore the expected running time is 1/z.
The previous example looks encouraging (after all, if you have a biased coin with a bias of p=0.99, you will need to throw your coin approximately 50 times, which is not that many). So you might think that this is an optimal algorithm. Sadly it is not.
Here is how it compares with the Shannon's theoretical bound (image is taken from this answer). It shows that the algorithm is good, but far from optimal.
You can come up with an improvement if you will consider that HHTT will be discarded by this algorithm, but in fact it has the same probability as TTHH. So you can also stop here and return H. The same is with HHHHTTTT and so on. Using these cases improves the expected running time, but are not making it theoretically optimal.
And in the end - python code:
import random
def biased(p):
# create a biased coin
return 1 if random.random() < p else 0
def unbiased_from_biased(p):
n1, n2 = biased(p), biased(p)
while n1 == n2:
n1, n2 = biased(p), biased(p)
return n1
p = random.random()
print p
tosses = [unbiased_from_biased(p) for i in xrange(1000)]
n_1 = sum(tosses)
n_2 = len(tosses) - n_1
print n_1, n_2
It is pretty self-explanatory, and here is an example result:
0.0973181652114
505 495
As you see, nonetheless we had a bias of 0.097, we got approximately the same number of 1 and 0
The trick attributed to von Neumann of getting two bits at a time, having 01 correspond to 0 and 10 to 1, and repeating for 00 or 11 has already come up. The expected value of bits you need to extract to get a single bit using this method is 1/p(1-p), which can get quite large if p is especially small or large, so it is worthwhile to ask whether the method can be improved, especially since it is evident that it throws away a lot of information (all 00 and 11 cases).
Googling for "von neumann trick biased" produced this paper that develops a better solution for the problem. The idea is that you still take bits two at a time, but if the first two attempts produce only 00s and 11s, you treat a pair of 0s as a single 0 and a pair of 1s as a single 1, and apply von Neumann's trick to these pairs. And if that doesn't work either, keep combining similarly at this level of pairs, and so on.
Further on, the paper develops this into generating multiple unbiased bits from the biased source, essentially using two different ways of generating bits from the bit-pairs, and giving a sketch that this is optimal in the sense that it produces exactly the number of bits that the original sequence had entropy in it.
You need to draw pairs of values from the RNG until you get a sequence of different values, i.e. zero followed by one or one followed by zero. You then take the first value (or last, doesn't matter) of that sequence. (i.e. Repeat as long as the pair drawn is either two zeros or two ones)
The math behind this is simple: a 0 then 1 sequence has the very same probability as a 1 then zero sequence. By always taking the first (or the last) element of this sequence as the output of your new RNG, we get an even chance to get a zero or a one.
Besides the von Neumann procedure given in other answers, there is a whole family of techniques, called randomness extraction (also known as debiasing, deskewing, or whitening), that serve to produce unbiased random bits from random numbers of unknown bias. They include Peres's (1992) iterated von Neumann procedure, as well as an "extractor tree" by Zhou and Bruck (2012). Both methods (and several others) are asymptotically optimal, that is, their efficiency (in terms of output bits per input) approaches the optimal limit as the number of inputs gets large (Pae 2018).
For example, the Peres extractor takes a list of bits (zeros and ones with the same bias) as input and is described as follows:
Create two empty lists named U and V. Then, while two or more bits remain in the input:
If the next two bits are 0/0, append 0 to U and 0 to V.
Otherwise, if those bits are 0/1, append 1 to U, then write a 0.
Otherwise, if those bits are 1/0, append 1 to U, then write a 1.
Otherwise, if those bits are 1/1, append 0 to U and 1 to V.
Run this algorithm recursively, reading from the bits placed in U.
Run this algorithm recursively, reading from the bits placed in V.
This is not to mention procedures that produce unbiased random bits from biased dice or other biased random numbers (not just biased bits); see, e.g., Camion (1974).
I discuss more on randomness extractors in a note on randomness extraction.
REFERENCES:
Peres, Y., "Iterating von Neumann's procedure for extracting random bits", Annals of Statistics 1992,20,1, p. 590-597.
Zhou, H. And Bruck, J., "Streaming algorithms for optimal generation of random bits", arXiv:1209.0730 [cs.IT], 2012.
S. Pae, "Binarization Trees and Random Number Generation", arXiv:1602.06058v2 [cs.DS].
Camion, Paul, "Unbiased die rolling with a biased die", North Carolina State University. Dept. Of Statistics, 1974.
Here's one way, probably not the most efficient. Chew through a bunch of random numbers until you get a sequence of the form [0..., 1, 0..., 1] (where 0... is one or more 0s). Count the number of 0s. If the first sequence is longer, generate a 0, if the second sequence is longer, generate a 1. (If they're the same, try again.)
This is like what HotBits does to generate random numbers from radioactive particle decay:
Since the time of any given decay is random, then the interval between two consecutive decays is also random. What we do, then, is measure a pair of these intervals, and emit a zero or one bit based on the relative length of the two intervals. If we measure the same interval for the two decays, we discard the measurement and try again
HotBits: How It Works
I'm just explaining the already proposed solutions with some running proof. This solution will be unbiased, no matter how many times we change the probability. In a head n tail toss, the exclusivity of consecutive head n tail or tail n head is always unbiased.
import random
def biased_toss(probability):
if random.random() > probability:
return 1
else:
return 0
def unbiased_toss(probability):
x = biased_toss(probability)
y = biased_toss(probability)
while x == y:
x = biased_toss(probability)
y = biased_toss(probability)
else:
return x
# results with contain counts of heads '0' and tails '1'
results = {'0':0, '1':0}
for i in range(1000):
# on every call we are changing the probability
p = random.random()
results[str(unbiased_toss(p))] += 1
# it still return unbiased result
print(results)