Generating uniform random bits using a source of biased bits [duplicate] - random

This question already has answers here:
Unbiased random number generator using a biased one
(7 answers)
Closed 2 years ago.
We have a source of biased random bits, namely a source that produces 0 with probability p, or 1 with probability 1 - p.
How can we use this source to build a generator that produces 0 or 1 with equal probability?

You throw the biased coin two times. There are four possible outcomes:
01 - Pick the number 0 as the return value.
10 - Pick the number 1 as the return value.
00 - Start again throwing two coins.
11 - Start again throwing two coins.
When you calculate the probability for getting 01 or 10, you will see it's the same. That way the sequence 01 and 10 appear with the same probability, regardless of the value p.

One way would be to call your biased generator N times, sum up the results and then take your unbiased sample to be 0 if the sum is less than the median of the sum, 1 otherwise. The only trick is knowing how to pick N and what the median is. In the wiki article they discuss finding the median. (Note they have p the other way round, in the article its the probability of 1)

Well, there is an approach, called entropy extractor, which allows to get (good) random numbers from not quite random source(s).
If you have three independent but somewhat low quality (biased) RNGs, you could combine them together into uniform source.
Suppose you have three generators giving you a single byte each, then uniform output would be
t = X*Y + Z
where addition and multiplication are done over GF(28) finite field.
Code, Python
import numpy as np
import matplotlib.pyplot as plt
from pyfinite import ffield
def RNG(p):
return np.random.binomial(1, p) + \
np.random.binomial(1, p)*2 + \
np.random.binomial(1, p)*4 + \
np.random.binomial(1, p)*8 + \
np.random.binomial(1, p)*16 + \
np.random.binomial(1, p)*32 + \
np.random.binomial(1, p)*64 + \
np.random.binomial(1, p)*128
def muRNG(p):
X = RNG(p)
Y = RNG(p)
Z = RNG(p)
GF = ffield.FField(8)
return GF.Add(GF.Multiply(X, Y), Z)
N = 100000
hist = np.zeros(256, dtype=np.int32)
for k in range(0, N):
q = muRNG(0.7)
hist[q] += 1
x = np.arange(0, 256, dtype=np.int32)
fig, ax = plt.subplots()
ax.stem(x, hist, markerfmt=' ')
plt.show()
will produce graph of bytes distribution - looks reasonable uniform for values in [0...256). I could find paper where this idea was proposed
Just for illustration, this is how it looks when we collect bytes without entropy extraction, code
...
q = RNG(0.7) # just byte value from bits with p=0.7
...
and graph

Related

What is the most rng-efficient uniform random integer algorithm? [duplicate]

This question already has answers here:
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?
(5 answers)
Closed last year.
This is not a duplicate of #11766794 “What is the optimal algorithm for generating an unbiased random integer within a range?”. Both its top-upvoted answer and its Accepted answer have to do with floating-point extrapolation, which will not even yield perfectly uniform integers. That question is asking after ways to quickly get good approximations of a uniform random integer given a serviceably uniform floating-point rand() value; I am asking this question in the context of a perfectly uniform random integer algorithm given a true random bit generator (deterministic or otherwise; the question applies equally either way).
I am asking, specifically, about theoretical optimality in terms only of efficiency w/r/t random bits used: given a random bit stream, what is the algorithm which consumes the fewest number of bits from it in the process of generating a perfectly uniform random integer within a given range?
For instance, CPython 3.9.0's random.randbelow has at least one trivial inefficiency—it wastes a random bit when called over any power of 2 (including the trivial range):
def randbelow(n):
"Return a random int in the range [0,n). Returns 0 if n==0."
if not n:
return 0
k = n.bit_length() # don't use (n-1) here because n can be 1
r = getrandbits(k) # 0 <= r < 2**k
while r >= n:
r = getrandbits(k)
return r
While this is easily enough patched by replacing "not n" with "n <= 1" and "n.bit_length()" with "(n-1).bit_length()", a little analysis shows it leaves even more to be desired:
Say one is generating an integer over the range [0, 4097): half of all the calls to getrandbits(13) will overshoot the value: if the first bit and, say, the second bit are high, it will consume 11 more bits anyway, and discard them when it seemingly didn't need to. So it would seem that this algorithm is obviously nonoptimal.
The best I could come with in an hour this evening was the following algorithm:
def randbelow(n):
if n <= 1:
return 0
k = (n - 1).bit_length() # this is POPCNT for bignums
while True:
r = 0
for i in reversed(range(k)):
r |= getrandbits(1) << i
if r >= n:
break
else:
return r
However, I am no mathematician, and just because I fixed the inefficiencies that I could immediately see doesn't give me any confidence that I have just instantly stumbled in an afternoon on the most efficient possible uniform integer selection algorithm.
Say, for instance, the bits are being purchased from a quantum or atmospheric RNG service; or as part of a multi-party protocol in which every individual bit generation takes several round-trips; or on an embedded device without any hardware RNG support… whatever the case may be, I'm only asking the direct question: what algorithm for generating (perfectly) uniformly random integers from a true random bit stream is the most efficient with respect to random bits consumed? (Or, if not known with certainty, what is the best current candidate?)
(I have used Python in these examples because it's what I am working in primarily this season, but the question is by no means specific to any language, except inasfaras the algorithm itself must generalize to numbers above 264.)
The Python below implements arithmetic coding in exact arithmetic. This becomes very expensive computationally but achieves entropy + O(1) bits in expectation, which is basically optimal.
from fractions import Fraction
from math import floor, log2
import random
meter = 0
def metered_random_bits():
global meter
while True:
yield bool(random.randrange(2))
meter += 1
class ArithmeticDecoder:
def __init__(self, bits):
self._low = Fraction(0)
self._width = Fraction(1)
self._bits = bits
def randrange(self, n):
self._low *= n
self._width *= n
while True:
f = floor(self._low)
if self._low + self._width <= f + 1:
self._low -= f
return f
self._width /= 2
if next(self._bits):
self._low += self._width
import collections
if __name__ == "__main__":
k = 3000
n = 7
decoder = ArithmeticDecoder(metered_random_bits())
print(collections.Counter(decoder.randrange(n) for i in range(k)))
print("used", meter, "bits")
print("entropy", k * log2(n), "bits")

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).
Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.
This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;
If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.
You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).
I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.

Algorithm that increases randomness?

Suppose i provide you with random seeds between 0 and 1 but after some observations you find out that my seeds are not distributed properly and most of them are less than 0.5, would you still be able to use this source by using an algorithm that makes the seeds more distributed?
If yes, please provide me with necessary sources.
It really depends on how numbers are distributed in interval [0...1]. In general, you need CDF (cumulative distribution function) to map some arbitrary [0...1] domain distribution into uniform [0...1]. But for some particular cases you could do some simple transformation. Code below (in Python) first construct simple unfair RNG which generates 60% of numbers below 0.5 and 40% above.
import random
def unfairRng():
q = random.random()
if q < 0.6: # result is skewed toward [0...0.5] interval
return 0.5*random.random()
return 0.5 + 0.5*random.random()
random.seed(312345)
nof_trials = 100000
h = [0, 0]
for k in range(0, nof_trials):
q = unfairRng()
h[0 if q < 0.5 else 1] += 1
print(h)
I count then numbers above and below 0.5, and output on my machine is
[60086, 39914]
which is quite close to 60/40 split I described.
Ok, let's "fix" RNG by taking numbers from unfairRNG and alternating just returning value and next time returning 1-value. Again, Python code
def fairRng():
if (fairRng.even == 0):
fairRng.even = 1
return unfairRng()
else:
fairRng.even = 0
return 1.0 - unfairRng()
fairRng.even = 0
h = [0, 0]
for k in range(0, nof_trials):
q = fairRng()
h[0 if q < 0.5 else 1] += 1
print(h)
Again, counting histogram and result is
[49917, 50083]
which "fix" unfair RNG and make it fair.
Flipping a coin out of an unfair coin is done by flipping twice and, if the results are different, using the first; otherwise, discard the result.
This results in a coin with exactly 50/50 chance, but it's not guaranteed to run in finite time.
Random number sequences generated by any algorithm will have no more entropy ("randomness") than the seeds themselves. For instance, if each seed has an entropy of only 1 bit for every 64 bits, they can each be transformed, at least in theory, to a 1 bit random number with full entropy. However, measuring the entropy of those seeds is nontrivial (entropy estimation). Moreover, not every algorithm is suitable in all cases for extracting the entropy of random seeds (entropy extraction, randomness extraction).

How can I efficiently calculate the binomial cumulative distribution function?

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).
I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.
So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.
If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.
I'm aware that this problem relates to finding the area under a binomial curve, however:
My math-fu is not up to the task of translating this knowledge into efficient code
While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.
I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.
I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.
Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323
I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.
I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.
I can't totally vouch for the efficiency, but Scipy has a module for this
from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)
An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.
I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials
Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.
Lets say I know the probability of throwing a heads with a particular coin is P.
What is the probability of me throwing
the coin T times and getting at least
S heads?
Set n = T
Set beta[i] = 0 for i = 0, ... S - 1
Set beta[i] = 1 for i = S, ... T
Set t = p
Evaluate B(t) using de Casteljau
or at most S heads?
Set n = T
Set beta[i] = 1 for i = 0, ... S
Set beta[i] = 0 for i = S + 1, ... T
Set t = p
Evaluate B(t) using de Casteljau
Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.
If you are using Python, no need to code it yourself. Scipy got you covered:
from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434
# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777
From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.
The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.
If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.
Try this one, used in GMP. Another reference is this.
import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of
heads percentage and 10000 times
done.
sum(x>12)/len(x)
The output is 41% of times we got 12 heads.

Split square into small squares

I have big square. And I would like to split this square into small squares. I need all possible combinations. I know that there are infinity count of combinations but I have one limitation. I have fixed size for smallest square.
I can implement it using brute force. But it is too long.
Is any preferable algorithm for this task?
Thanks!
Well this problem only have solution if we made 2 assumptions (otherwise there is infine combinations)
1) the smalest square has a fixed size
2) the way to cut the big square is also fixed, as if you put the square into a grid which lines are separated by the size of the small square.
There is also an third assumption that would make the problem a bit easier
3) The big square has side K-times bigger than the small square. With K being an integer.
If both assumptions are true we can proceed:
Find the number of N (N beign integer) smallest squares: squares with size N*small-size
if ((big-size % N*small-size)==0)
Number += (big-size / N*small-size)^2
else
Number += ((big-size / N*small-size)^2) * (big-size % N*small-size)
The * (big-size % Nsmall-size) in the else condition is there becouse if the bigsquare isn't divided by N, when "griding" the big square with gid-width of N, we will have an "fraction part" left. This way we can start dividing again (griding again) but with an offset of 1, 2 or N small steps. The amount of steps is (big-size % Nsmall-size).
Again, these steps only hold true if the 3 assumptions were took.
There are no "infinite" combinations. In fact, the number may be large but is bounded.
Moreover, if you need strict squares (width=height, as opposed to just rectangles) there are even less, since you have to divide the original square (of side L) by the same integers on both sides otherwise you'll be getting rectangles as well.
If you're working with integers, I would recommend just dividing L by 2, 3, ... M (L/M = minimum inner square length).
My math is a bit fuzzy but if you have a square (n^2), then you have the length of one side (n).
From n you can calculate all the factors for that number, and use them as the sides for the smaller squares...
E.g.
n^2 = 44100
n = 210
Factors of n = x=2,x=3,x=5,x=7, x= ... and so on.
So you smaller squares for each factor are
x=2 : x^2 = 4 : 44100 / 4 = 11025 small squares
x=3 : x^2 = 9 : 44100 / 9 = 4900 small squares
x=5 : x^2 = 25 : 44100 / 25 = 1764 small squares
x=7 : x^2 = 49 : 44100 / 49 = 900 small squares
and so on
I propose the following Python code to randomly split a big square (512x512) into smaller squares (min. 32x32).
import random
import numpy as np
n = 16
cellSize = 32
smallSizes = [1,2,4,8]
def is_cell_available(grid, indexRow, indexCol):
return not bool(grid[indexRow, indexCol])
def try_square_size(grid, indexRow, indexCol):
smSizes = smallSizes.copy()
while True:
potentialSize = random.choice(smSizes)
if indexRow + potentialSize <= n and indexCol + potentialSize <= n:
if np.count_nonzero(grid[indexRow:indexRow+potentialSize, indexCol:indexCol+potentialSize])==0:
return potentialSize
else:
smSizes.remove(potentialSize)
grid = np.zeros((n,n), np.uint8)
bigMat = np.zeros((n*cellSize,n*cellSize), np.uint8)
for indexRow in range(n):
for indexCol in range(n):
if is_cell_available(grid, indexRow, indexCol):
blockSize = try_square_size(grid, indexRow, indexCol)
val = random.randint(1,255)
grid[indexRow:indexRow+blockSize, indexCol:indexCol+blockSize] = val
bigMat[indexRow*cellSize:(indexRow+blockSize)*cellSize, indexCol*cellSize:(indexCol+blockSize)*cellSize] = val
else:
pass

Resources