Number of items necessary to exceed a given collision probability for large spaces - algorithm

(This is not a homework problem. If there is a class that offers this question as homework, please tell me as I would love to take it.)
This is related to the birthday problem.
I'm looking for a practical algorithm to calculate the number of items necessary to exceed a collision probability of p for large spaces. I need this for evaluating the suitability of hashing algorithms for storing large numbers of items.
For example, f(365, .5) should return 23, the number of people needed to exceed 0.5 probability that anyone share the same birthday.
I have created a simple implementation using an exact collision probability calculation:
def _items_for_p(buckets, p):
"""Return the number of items for chance of collision to exceed p."""
logger.debug('_items_for_p($r, $r)', buckets, p)
up = buckets
down = 1
while up > (down + 1):
n = (up + down) // 2
logger.debug('up=%r, down=%r, n=%r', up, down, n)
if _collision_p(buckets, n) > p:
logger.debug('Lowering up to %r', n)
up = n
else:
logger.debug('Raising down to %r', n)
down = n
return up
def _collision_p(buckets, items):
"""Return the probability of a collision."""
return 1 - _no_collision_p(buckets, items)
def _no_collision_p(buckets, items):
"""Return the probability of no collision."""
logger.debug('_no_collision_p(%r, %r)', buckets, items)
fac = math.factorial
return fac(buckets) / ((buckets ** items) * fac(buckets - items))
Needless to say, this does not work for the large spaces I want to work with (2^256, 2^512, etc).
I am looking for an algorithm that can calculate this in a reasonable amount of time with reasonable accuracy. The Wikipedia page provides mathematical approximations, but admittedly my math is a bit rusty, and I don't want to spend a lot of time investigating one approximation only to find that I cannot both generalize it and implement it quickly.

Solution to generalised birthday problem or probability p=0.5:
As noted by Wikipedia there is no proven formula that is quick to compute, but there is a formula that is conjectured to be exact. The formula involves computing square roots, natural logarithms, and basic arithmetic:
Sqrt(2*d*ln 2) + (3 - 2 * ln 2)/6 + (9 - 4(ln 2)^2)/(72 + Sqrt(2*d*ln 2) ) - 2 ln(2)^2/(135* d)
so you can feed in your d=2^256 and find out the answer that is conjectured to be exact.
Here's a quick attempt at implementing it, limited to the accuracy of python floats:
def solve_birthday_problem( d ):
ln2 = math.log(2)
term1 = (2*d*ln2)**0.5
term2 = (3 - 2 * ln2)/6.0
term3 = (9 - 4*(ln2)**2)/(72 + (2*d*ln2)**0.5 )
term4 = 2*ln2**2/(135.0 * d)
return math.ceil(term1 + term2 + term3 - term4)
You will need to fix it up to get an accurate precision integer result. The decimal library may be what is needed to fix this.

Related

What is the terminology for reducing inputs to only distinct elements to speed computation?

Is there a term for leveraging the fact that data is comprised of a few much-repeated values to speed computation?
As an example when trying to compute Sample Entropy on a long discrete sequence (Length=64.000.000.000, Distinct elements = 11, Length of substring=3) I was finding the running time too long (over 10 minutes). I realised that I should be able to make use of the relatively few distinct elements to speed up computation but was unable to find any literature relating to doing this (I suspect because I don't know what to Google).
The algorithm for Sample Entropy involves counting the pairs of substrings that are within a certain tolerance. This was the computationally expensive aspect of the algorithm O(n^2). By taking only the distinct substrings (of which there were at most 1331) I was able to find the pairs of distinct substrings within the tolerance, I then used the counts of each distinct substring to find the total number of pairs of (non-distinct) substrings that are within a certain tolerance. This method substantially sped up my computation.
Do algorithms that make use of the property of relatively few, much-repeated elements have a specific terminology.
def sampen(L, m, r):
N = len(L)
B = 0.0
A = 0.0
# Split time series and save all templates of length m
xmi = np.array([L[i : i + m] for i in range(N - m)])
xmj = np.array([L[i : i + m] for i in range(N - m + 1)])
# Save all matches minus the self-match, compute B
B = np.sum([np.sum(np.abs(xmii - xmj).max(axis=1) <= r) - 1 for xmii in xmi])
# Similar for computing A
m += 1
xm = np.array([L[i : i + m] for i in range(N - m + 1)])
A = np.sum([np.sum(np.abs(xmi - xm).max(axis=1) <= r) - 1 for xmi in xm])
# Return SampEn
return -np.log(A / B)
def sampen2(L, m, r):
N = L.shape[0]
# Split time series and save all templates of length m
xmi = np.array([L[i : i + m] for i in range(N - m)])
xmj = np.array([L[i : i + m] for i in range(N - m + 1)])
# Find the unique subsequences and their counts
uni_xmi, uni_xmi_counts = np.unique(xmi, axis=0, return_counts = True)
uni_xmj, uni_xmj_counts = np.unique(xmj, axis=0, return_counts = True)
# Save all matches minus the self-match, compute B
B = np.sum(np.array([np.sum((np.abs(unii - uni_xmi).max(axis=1) <= r)*uni_xmj_counts)-1 for unii in uni_xmi])*uni_xmi_counts)
# Similar for computing A
m +=1
xm = np.array([L[i: i + m] for i in range(N - m + 1)])
uni_xm, uni_xm_counts= np.unique(xm, axis=0, return_counts = True)
A = np.sum(np.array([np.sum((np.abs(unii - uni_xm).max(axis=1) <= r)*uni_xm_counts)-1 for unii in uni_xm])*uni_xm_counts)
return -np.log(A / B)
It's a broad concept with several related terms.
A common, closely related term is Memoization, wherein the results of computing a subproblem for different inputs are stored, and reused when a previously-seen input is re-encountered. That's slightly different from what you're doing here, since memoization is a form of lazy evaluation where values are recognized opportunistically rather than the code performing an up-front exhaustive enumeration of the inputs which will be processed.
Materialization is also worth mentioning. It's encountered in the context of databases, and refers to the results of a query (a.k.a. tabular processing including possible filtering and/or reduction) being stored for reuse. The active concerns with materialization are largely around long-term considerations like dynamic updates, so it's not a perfect match for a run-and-forget algorithm.
Speaking of 'dynamic', one could also maybe describe this as a form of dynamic programming, with a problem solved by exhaustively enumerating and solving a sequence of subproblems. In dynamic programming, though, one expects those subproblems to have a more regular and inductive form, so I think that one's a stretch.
I would describe the precise strategy here as a sort of "eager memoization", to contrast with the lazy-evaluation assumption normally inherent with memoization.

How to test if a function grows logarithmically?

I have function in my model that counts user's score:
def score
(MULTIPLER * Math::log10(bets.count * summary_value ** accuracy + 1)).floor
end
My point is to test that it grows logarithmically?
The point of a test isn't to prove it always works (this is the area for static typing/proofs), but to check that it is probably working. This is normally good enough. I'm guessing you are doing it for a game, and what to ensure the function doesn't "grow" too quickly.
A way we could do that is to try a number of values, and check whether they are following a general logarithmic pattern.
For example, consider a pure logarithmic function f(x) = log(x) (any base):
If f(x) = y, then f(x^n) = f(x) * n.
So, if f(x^n) == (f(x) * n), then the function is logarithmic.
Compare that to a linear function, eg f(x) == x * 2. f(x^n) = x^n * 2, ie x^(n - 1) times bigger (a lot bigger).
You may have a more complex logarithmic function, eg f(x) = log(x + 7) + 3456. The pattern still holds though, just less accurately. So what I did was:
Attempt to calculate the constant value, by using x = 1
Find the difference f(x^n) - f(x) * n.
Find the absolute difference of f((x*100)^n) - f(100x) * n
If (3)/(2) is less than 10, it is almost certainly not linear, and probably logarithmic. The 10 is just an arbitrary number. Most linear functions will be different by a factor of more than a billion. Even functions like sqrt(x) will have a bigger difference than 10.
My example code will just have the score method take a parameter, and test against that (to keep it simple + I don't have your supporting code).
require 'rspec'
require 'rspec/mocks/standalone'
def score(input)
Math.log2(input * 3 + 1000 * 3) * 3 + 100 + Math.sin(input)
end
describe "score" do
it "grows logarithmacally based on input" do
x = 50
n = 8
c = score(1)
result1 = (score(x ** n) - c) / ((score(x) -c) * n)
x *= 100
result2 = (score(x ** n) - c) / ((score(x) -c) * n)
(result2 / result1).abs.should be < 10
end
end
Though I almost forget complex math knowledge, it seems the fact can't stop me answering the question.
My suggestion as follows
describe "#score" do
it "grows logarithmically" do
round_1 = FactoryGirl.create_list(:bet, 10, value: 5).score
round_2 = FactoryGirl.create_list(:bet, 11, value: 5).score
# Then expect some math relation between round_1 and round_2,
# calculated by you manually.
end
end
If you need to treat this function like a black box, the only true solution is to get a bunch of values and see if their curve is well-approximated by a logarithmic curve, focusing on large n. If you could put reasonable bounds on it, like a log(n) < score(n) < b log(n) for some values a and b then you could just check that.
Generally speaking, the best way to see if a function grows is to plot some data on a graph. Just use some graph plotting gem and evaluate the result.
A logarithmic function will always look like this:
(source: sosmath.com)
You can then adjust how fast it grows through your parameters, and replot the graph, until you found yourself happy with the result.

Fast solution to Subset sum

Consider this way of solving the Subset sum problem:
def subset_summing_to_zero (activities):
subsets = {0: []}
for (activity, cost) in activities.iteritems():
old_subsets = subsets
subsets = {}
for (prev_sum, subset) in old_subsets.iteritems():
subsets[prev_sum] = subset
new_sum = prev_sum + cost
new_subset = subset + [activity]
if 0 == new_sum:
new_subset.sort()
return new_subset
else:
subsets[new_sum] = new_subset
return []
I have it from here:
http://news.ycombinator.com/item?id=2267392
There is also a comment which says that it is possible to make it "more efficient".
How?
Also, are there any other ways to solve the problem which are at least as fast as the one above?
Edit
I'm interested in any kind of idea which would lead to speed-up. I found:
https://en.wikipedia.org/wiki/Subset_sum_problem#cite_note-Pisinger09-2
which mentions a linear time algorithm. But I don't have the paper, perhaps you, dear people, know how it works? An implementation perhaps? Completely different approach perhaps?
Edit 2
There is now a follow-up:
Fast solution to Subset sum algorithm by Pisinger
I respect the alacrity with which you're trying to solve this problem! Unfortunately, you're trying to solve a problem that's NP-complete, meaning that any further improvement that breaks the polynomial time barrier will prove that P = NP.
The implementation you pulled from Hacker News appears to be consistent with the pseudo-polytime dynamic programming solution, where any additional improvements must, by definition, progress the state of current research into this problem and all of its algorithmic isoforms. In other words: while a constant speedup is possible, you're very unlikely to see an algorithmic improvement to this solution to the problem in the context of this thread.
However, you can use an approximate algorithm if you require a polytime solution with a tolerable degree of error. In pseudocode blatantly stolen from Wikipedia, this would be:
initialize a list S to contain one element 0.
for each i from 1 to N do
let T be a list consisting of xi + y, for all y in S
let U be the union of T and S
sort U
make S empty
let y be the smallest element of U
add y to S
for each element z of U in increasing order do
//trim the list by eliminating numbers close to one another
//and throw out elements greater than s
if y + cs/N < z ≤ s, set y = z and add z to S
if S contains a number between (1 − c)s and s, output yes, otherwise no
Python implementation, preserving the original terms as closely as possible:
from bisect import bisect
def ssum(X,c,s):
""" Simple impl. of the polytime approximate subset sum algorithm
Returns True if the subset exists within our given error; False otherwise
"""
S = [0]
N = len(X)
for xi in X:
T = [xi + y for y in S]
U = set().union(T,S)
U = sorted(U) # Coercion to list
S = []
y = U[0]
S.append(y)
for z in U:
if y + (c*s)/N < z and z <= s:
y = z
S.append(z)
if not c: # For zero error, check equivalence
return S[bisect(S,s)-1] == s
return bisect(S,(1-c)*s) != bisect(S,s)
... where X is your bag of terms, c is your precision (between 0 and 1), and s is the target sum.
For more details, see the Wikipedia article.
(Additional reference, further reading on CSTheory.SE)
While my previous answer describes the polytime approximate algorithm to this problem, a request was specifically made for an implementation of Pisinger's polytime dynamic programming solution when all xi in x are positive:
from bisect import bisect
def balsub(X,c):
""" Simple impl. of Pisinger's generalization of KP for subset sum problems
satisfying xi >= 0, for all xi in X. Returns the state array "st", which may
be used to determine if an optimal solution exists to this subproblem of SSP.
"""
if not X:
return False
X = sorted(X)
n = len(X)
b = bisect(X,c)
r = X[-1]
w_sum = sum(X[:b])
stm1 = {}
st = {}
for u in range(c-r+1,c+1):
stm1[u] = 0
for u in range(c+1,c+r+1):
stm1[u] = 1
stm1[w_sum] = b
for t in range(b,n+1):
for u in range(c-r+1,c+r+1):
st[u] = stm1[u]
for u in range(c-r+1,c+1):
u_tick = u + X[t-1]
st[u_tick] = max(st[u_tick],stm1[u])
for u in reversed(range(c+1,c+X[t-1]+1)):
for j in reversed(range(stm1[u],st[u])):
u_tick = u - X[j-1]
st[u_tick] = max(st[u_tick],j)
return st
Wow, that was headache-inducing. This needs proofreading, because, while it implements balsub, I can't define the right comparator to determine if the optimal solution to this subproblem of SSP exists.
I don't know much python, but there is an approach called meet in the middle.
Pseudocode:
Divide activities into two subarrays, A1 and A2
for both A1 and A2, calculate subsets hashes, H1 and H2, the way You do it in Your question.
for each (cost, a1) in H1
if(H2.contains(-cost))
return a1 + H2[-cost];
This will allow You to double the number of elements of activities You can handle in reasonable time.
I apologize for "discussing" the problem, but a "Subset Sum" problem where the x values are bounded is not the NP version of the problem. Dynamic programing solutions are known for bounded x value problems. That is done by representing the x values as the sum of unit lengths. The Dynamic programming solutions have a number of fundamental iterations that is linear with that total length of the x's. However, the Subset Sum is in NP when the precision of the numbers equals N. That is, the number or base 2 place values needed to state the x's is = N. For N = 40, the x's have to be in the billions. In the NP problem the unit length of the x's increases exponentially with N.That is why the dynamic programming solutions are not a polynomial time solution to the NP Subset Sum problem. That being the case, there are still practical instances of the Subset Sum problem where the x's are bounded and the dynamic programming solution is valid.
Here are three ways to make the code more efficient:
The code stores a list of activities for each partial sum. It is more efficient in terms of both memory and time to just store the most recent activity needed to make the sum, and work out the rest by backtracking once a solution is found.
For each activity the dictionary is repopulated with the old contents (subsets[prev_sum] = subset). It is faster to simply grow a single dictionary
Splitting the values in two and applying a meet in the middle approach.
Applying the first two optimisations results in the following code which is more than 5 times faster:
def subset_summing_to_zero2 (activities):
subsets = {0:-1}
for (activity, cost) in activities.iteritems():
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if 0 == new_sum:
new_subset = [activity]
while prev_sum:
activity = subsets[prev_sum]
new_subset.append(activity)
prev_sum -= activities[activity]
return sorted(new_subset)
if new_sum in subsets: continue
subsets[new_sum] = activity
return []
Also applying the third optimisation results in something like:
def subset_summing_to_zero3 (activities):
A=activities.items()
mid=len(A)//2
def make_subsets(A):
subsets = {0:-1}
for (activity, cost) in A:
for prev_sum in subsets.keys():
new_sum = prev_sum + cost
if new_sum and new_sum in subsets: continue
subsets[new_sum] = activity
return subsets
subsets = make_subsets(A[:mid])
subsets2 = make_subsets(A[mid:])
def follow_trail(new_subset,subsets,s):
while s:
activity = subsets[s]
new_subset.append(activity)
s -= activities[activity]
new_subset=[]
for s in subsets:
if -s in subsets2:
follow_trail(new_subset,subsets,s)
follow_trail(new_subset,subsets2,-s)
if len(new_subset):
break
return sorted(new_subset)
Define bound to be the largest absolute value of the elements.
The algorithmic benefit of the meet in the middle approach depends a lot on bound.
For a low bound (e.g. bound=1000 and n=300) the meet in the middle only gets a factor of about 2 improvement other the first improved method. This is because the dictionary called subsets is densely populated.
However, for a high bound (e.g. bound=100,000 and n=30) the meet in the middle takes 0.03 seconds compared to 2.5 seconds for the first improved method (and 18 seconds for the original code)
For high bounds, the meet in the middle will take about the square root of the number of operations of the normal method.
It may seem surprising that meet in the middle is only twice as fast for low bounds. The reason is that the number of operations in each iteration depends on the number of keys in the dictionary. After adding k activities we might expect there to be 2**k keys, but if bound is small then many of these keys will collide so we will only have O(bound.k) keys instead.
Thought I'd share my Scala solution for the discussed pseudo-polytime algorithm described in wikipedia. It's a slightly modified version: it figures out how many unique subsets there are. This is very much related to a HackerRank problem described at https://www.hackerrank.com/challenges/functional-programming-the-sums-of-powers. Coding style might not be excellent, I'm still learning Scala :) Maybe this is still helpful for someone.
object Solution extends App {
var input = "1000\n2"
System.setIn(new ByteArrayInputStream(input.getBytes()))
println(calculateNumberOfWays(readInt, readInt))
def calculateNumberOfWays(X: Int, N: Int) = {
val maxValue = Math.pow(X, 1.0/N).toInt
val listOfValues = (1 until maxValue + 1).toList
val listOfPowers = listOfValues.map(value => Math.pow(value, N).toInt)
val lists = (0 until maxValue).toList.foldLeft(List(List(0)): List[List[Int]]) ((newList, i) =>
newList :+ (newList.last union (newList.last.map(y => y + listOfPowers.apply(i)).filter(z => z <= X)))
)
lists.last.count(_ == X)
}
}

How can I efficiently calculate the binomial cumulative distribution function?

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).
I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.
So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.
If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.
I'm aware that this problem relates to finding the area under a binomial curve, however:
My math-fu is not up to the task of translating this knowledge into efficient code
While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.
I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.
I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.
Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323
I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.
I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.
I can't totally vouch for the efficiency, but Scipy has a module for this
from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)
An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.
I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials
Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.
Lets say I know the probability of throwing a heads with a particular coin is P.
What is the probability of me throwing
the coin T times and getting at least
S heads?
Set n = T
Set beta[i] = 0 for i = 0, ... S - 1
Set beta[i] = 1 for i = S, ... T
Set t = p
Evaluate B(t) using de Casteljau
or at most S heads?
Set n = T
Set beta[i] = 1 for i = 0, ... S
Set beta[i] = 0 for i = S + 1, ... T
Set t = p
Evaluate B(t) using de Casteljau
Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.
If you are using Python, no need to code it yourself. Scipy got you covered:
from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434
# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777
From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.
The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.
If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.
Try this one, used in GMP. Another reference is this.
import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of
heads percentage and 10000 times
done.
sum(x>12)/len(x)
The output is 41% of times we got 12 heads.

"Approximate" greatest common divisor

Suppose you have a list of floating point numbers that are approximately multiples of a common quantity, for example
2.468, 3.700, 6.1699
which are approximately all multiples of 1.234. How would you characterize this "approximate gcd", and how would you proceed to compute or estimate it?
Strictly related to my answer to this question.
You can run Euclid's gcd algorithm with anything smaller then 0.01 (or a small number of your choice) being a pseudo 0. With your numbers:
3.700 = 1 * 2.468 + 1.232,
2.468 = 2 * 1.232 + 0.004.
So the pseudo gcd of the first two numbers is 1.232. Now you take the gcd of this with your last number:
6.1699 = 5 * 1.232 + 0.0099.
So 1.232 is the pseudo gcd, and the mutiples are 2,3,5. To improve this result, you may take the linear regression on the data points:
(2,2.468), (3,3.7), (5,6.1699).
The slope is the improved pseudo gcd.
Caveat: the first part of this is algorithm is numerically unstable - if you start with very dirty data, you are in trouble.
Express your measurements as multiples of the lowest one. Thus your list becomes 1.00000, 1.49919, 2.49996. The fractional parts of these values will be very close to 1/Nths, for some value of N dictated by how close your lowest value is to the fundamental frequency. I would suggest looping through increasing N until you find a sufficiently refined match. In this case, for N=1 (that is, assuming X=2.468 is your fundamental frequency) you would find a standard deviation of 0.3333 (two of the three values are .5 off of X * 1), which is unacceptably high. For N=2 (that is, assuming 2.468/2 is your fundamental frequency) you would find a standard deviation of virtually zero (all three values are within .001 of a multiple of X/2), thus 2.468/2 is your approximate GCD.
The major flaw in my plan is that it works best when the lowest measurement is the most accurate, which is likely not the case. This could be mitigated by performing the entire operation multiple times, discarding the lowest value on the list of measurements each time, then use the list of results of each pass to determine a more precise result. Another way to refine the results would be adjust the GCD to minimize the standard deviation between integer multiples of the GCD and the measured values.
This reminds me of the problem of finding good rational-number approximations of real numbers. The standard technique is a continued-fraction expansion:
def rationalizations(x):
assert 0 <= x
ix = int(x)
yield ix, 1
if x == ix: return
for numer, denom in rationalizations(1.0/(x-ix)):
yield denom + ix * numer, numer
We could apply this directly to Jonathan Leffler's and Sparr's approach:
>>> a, b, c = 2.468, 3.700, 6.1699
>>> b/a, c/a
(1.4991896272285252, 2.4999594813614263)
>>> list(itertools.islice(rationalizations(b/a), 3))
[(1, 1), (3, 2), (925, 617)]
>>> list(itertools.islice(rationalizations(c/a), 3))
[(2, 1), (5, 2), (30847, 12339)]
picking off the first good-enough approximation from each sequence. (3/2 and 5/2 here.) Or instead of directly comparing 3.0/2.0 to 1.499189..., you could notice than 925/617 uses much larger integers than 3/2, making 3/2 an excellent place to stop.
It shouldn't much matter which of the numbers you divide by. (Using a/b and c/b you get 2/3 and 5/3, for instance.) Once you have integer ratios, you could refine the implied estimate of the fundamental using shsmurfy's linear regression. Everybody wins!
I'm assuming all of your numbers are multiples of integer values. For the rest of my explanation, A will denote the "root" frequency you are trying to find and B will be an array of the numbers you have to start with.
What you are trying to do is superficially similar to linear regression. You are trying to find a linear model y=mx+b that minimizes the average distance between a linear model and a set of data. In your case, b=0, m is the root frequency, and y represents the given values. The biggest problem is that the independent variables X are not explicitly given. The only thing we know about X is that all of its members must be integers.
Your first task is trying to determine these independent variables. The best method I can think of at the moment assumes that the given frequencies have nearly consecutive indexes (x_1=x_0+n). So B_0/B_1=(x_0)/(x_0+n) given a (hopefully) small integer n. You can then take advantage of the fact that x_0 = n/(B_1-B_0), start with n=1, and keep ratcheting it up until k-rnd(k) is within a certain threshold. After you have x_0 (the initial index), you can approximate the root frequency (A = B_0/x_0). Then you can approximate the other indexes by finding x_n = rnd(B_n/A). This method is not very robust and will probably fail if the error in the data is large.
If you want a better approximation of the root frequency A, you can use linear regression to minimize the error of the linear model now that you have the corresponding dependent variables. The easiest method to do so uses least squares fitting. Wolfram's Mathworld has a in-depth mathematical treatment of the issue, but a fairly simple explanation can be found with some googling.
Interesting question...not easy.
I suppose I would look at the ratios of the sample values:
3.700 / 2.468 = 1.499...
6.1699 / 2.468 = 2.4999...
6.1699 / 3.700 = 1.6675...
And I'd then be looking for a simple ratio of integers in those results.
1.499 ~= 3/2
2.4999 ~= 5/2
1.6675 ~= 5/3
I haven't chased it through, but somewhere along the line, you decide that an error of 1:1000 or something is good enough, and you back-track to find the base approximate GCD.
The solution which I've seen and used myself is to choose some constant, say 1000, multiply all numbers by this constant, round them to integers, find the GCD of these integers using the standard algorithm and then divide the result by the said constant (1000). The larger the constant, the higher the precision.
This is a reformulaiton of shsmurfy's solution when you a priori choose 3 positive tolerances (e1,e2,e3)
The problem is then to search smallest positive integers (n1,n2,n3) and thus largest root frequency f such that:
f1 = n1*f +/- e1
f2 = n2*f +/- e2
f3 = n3*f +/- e3
We assume 0 <= f1 <= f2 <= f3
If we fix n1, then we get these relations:
f is in interval I1=[(f1-e1)/n1 , (f1+e1)/n1]
n2 is in interval I2=[n1*(f2-e2)/(f1+e1) , n1*(f2+e2)/(f1-e1)]
n3 is in interval I3=[n1*(f3-e3)/(f1+e1) , n1*(f3+e3)/(f1-e1)]
We start with n1 = 1, then increment n1 until the interval I2 and I3 contain an integer - that is floor(I2min) different from floor(I2max) same with I3
We then choose smallest integer n2 in interval I2, and smallest integer n3 in interval I3.
Assuming normal distribution of floating point errors, the most probable estimate of root frequency f is the one minimizing
J = (f1/n1 - f)^2 + (f2/n2 - f)^2 + (f3/n3 - f)^2
That is
f = (f1/n1 + f2/n2 + f3/n3)/3
If there are several integers n2,n3 in intervals I2,I3 we could also choose the pair that minimize the residue
min(J)*3/2=(f1/n1)^2+(f2/n2)^2+(f3/n3)^2-(f1/n1)*(f2/n2)-(f1/n1)*(f3/n3)-(f2/n2)*(f3/n3)
Another variant could be to continue iteration and try to minimize another criterium like min(J(n1))*n1, until f falls below a certain frequency (n1 reaches an upper limit)...
I found this question looking for answers for mine in MathStackExchange (here and here).
I've only managed (yet) to measure the appeal of a fundamental frequency given a list of harmonic frequencies (following the sound/music nomenclature), which can be useful if you have a reduced number of options and is feasible to compute the appeal of each one and then choose the best fit.
C&P from my question in MSE (there the formatting is prettier):
being v the list {v_1, v_2, ..., v_n}, ordered from lower to higher
mean_sin(v, x) = sum(sin(2*pi*v_i/x), for i in {1, ...,n})/n
mean_cos(v, x) = sum(cos(2*pi*v_i/x), for i in {1, ...,n})/n
gcd_appeal(v, x) = 1 - sqrt(mean_sin(v, x)^2 + (mean_cos(v, x) - 1)^2)/2, which yields a number in the interval [0,1].
The goal is to find the x that maximizes the appeal. Here is the (gcd_appeal) graph for your example [2.468, 3.700, 6.1699], where you find that the optimum GCD is at x = 1.2337899957639993
Edit:
You may find handy this JAVA code to calculate the (fuzzy) divisibility (aka gcd_appeal) of a divisor relative to a list of dividends; you can use it to test which of your candidates makes the best divisor. The code looks ugly because I tried to optimize it for performance.
//returns the mean divisibility of dividend/divisor as a value in the range [0 and 1]
// 0 means no divisibility at all
// 1 means full divisibility
public double divisibility(double divisor, double... dividends) {
double n = dividends.length;
double factor = 2.0 / divisor;
double sum_x = -n;
double sum_y = 0.0;
double[] coord = new double[2];
for (double v : dividends) {
coordinates(v * factor, coord);
sum_x += coord[0];
sum_y += coord[1];
}
double err = 1.0 - Math.sqrt(sum_x * sum_x + sum_y * sum_y) / (2.0 * n);
//Might happen due to approximation error
return err >= 0.0 ? err : 0.0;
}
private void coordinates(double x, double[] out) {
//Bhaskara performant approximation to
//out[0] = Math.cos(Math.PI*x);
//out[1] = Math.sin(Math.PI*x);
long cos_int_part = (long) (x + 0.5);
long sin_int_part = (long) x;
double rem = x - cos_int_part;
if (cos_int_part != sin_int_part) {
double common_s = 4.0 * rem;
double cos_rem_s = common_s * rem - 1.0;
double sin_rem_s = cos_rem_s + common_s + 1.0;
out[0] = (((cos_int_part & 1L) * 8L - 4L) * cos_rem_s) / (cos_rem_s + 5.0);
out[1] = (((sin_int_part & 1L) * 8L - 4L) * sin_rem_s) / (sin_rem_s + 5.0);
} else {
double common_s = 4.0 * rem - 4.0;
double sin_rem_s = common_s * rem;
double cos_rem_s = sin_rem_s + common_s + 3.0;
double common_2 = ((cos_int_part & 1L) * 8L - 4L);
out[0] = (common_2 * cos_rem_s) / (cos_rem_s + 5.0);
out[1] = (common_2 * sin_rem_s) / (sin_rem_s + 5.0);
}
}

Resources