Fitting two lines to a set of 2D points

Fitting two lines to a set of 2D points - algorithm

Given a set of 2D points (black dots in the picture) I need to choose two lines to somehow represent those points. I probably need to minimize the sum of squares of [distance of x to the closer of two lines]^2, although if any other metric makes this easier, this is fine too.
The obvious but ineffective approach is to try min squares over all 2^n partitions. A practical approach is probably iterative improvement, maybe starting with a random partition.
Is there any research on algorithms to handle this problem?

I think this can be formulated as an explicit optimization problem:
min sum(j, r1(j)^2 + r2(j)^2) (quadratic)
subject to
r1(j) = (y(j) - a0 - a1*x(j)) * δ(j) (quadratic)
r2(j) = (y(j) - b0 - b1*x(j)) * (1-δ(j)) (quadratic)
δ(j) ∈ {0,1}
We do the assignment of points to a line and the regression (minimization of the sum of the squared residuals) at the same time.
This is a non-convex quadratically constrained mixed-integer quadratic programming problem. Solvers that can handle this include Gurobi, Baron, Couenne, Antigone.
We can reformulate this a bit:
min sum(j, r(j)^2) (convex quadratic)
subject to
r(j) = y(j) - a0 - a1*x(j) + s1(j) (one of these will be relaxed)
r(j) = y(j) - b0 - b1*x(j) + s2(j) (all linear)
-U*δ(j) <= s1(j) <= U*δ(j)
-U*(1-δ(j)) <= s2(j) <= U*(1-δ(j))
δ(j) ∈ {0,1}
s1(j),s2(j) ∈ [-U,U]
U = 1000 (reasonable bound, constant)
This makes it a straight convex MIQP model. This will allow more solvers to be used (e.g. Cplex) and it is much easier to solve. Some other formulations are here. Some of the models mentioned do not need the bounds I had to use in the above big-M formulation. It is noted these models deliver proven optimal solutions (for the non-convex models this would require a global solver; the convex models are easier and don't need this). Furthermore, instead of a least squares objective, we can also form an L1 objective. In the latter case we end up with a linear MIP model.
A small test confirms this works:
This problem has 50 points, and needed 1.25 seconds using Cplex's MIQP solver on a slow laptop. This may be a case of a statistical problem where MIP/MIQP methods have something to offer.

Two related concepts from the literature come to mind.
First, my intuition is that there should be a way to interpret this problem as estimating the parameters for a mixture model. I haven't worked out the details, since parameter estimation typically uses expectation--maximization, and I can just describe how that would work: initialize a partition into two parts, then alternately run a regression on each part and reassign points based on their distance to the new regression lines.
Second, assuming that the input is relatively clean, you should be able to get a good initial partition using RANSAC. For some small k, take two disjoint random samples of k points and fit lines through them, then assign every other point. For a (100 − x)% chance of success you'll want to repeat this about ln(100/x) × 22k−1 times and take the best one.

In OPL CPLEX if I start with the example curve fitting from Model Buidling
Let me add a few points to the .dat first
n=24;
x = [1,2,3,4,5,0.0, 0.5, 1.0, 1.5, 1.9, 2.5, 3.0, 3.5, 4.0, 4.5,
5.0, 5.5, 6.0, 6.6, 7.0, 7.6, 8.5, 9.0, 10.0];
y = [10,20,30,40,50,1.0, 0.9, 0.7, 1.5, 2.0, 2.4, 3.2, 2.0, 2.7, 3.5,
1.0, 4.0, 3.6, 2.7, 5.7, 4.6, 6.0, 6.8, 7.3];
then .mod with MIP and absolute value for distance
execute
{
cplex.tilim=10;
}
int n=...;
range points=1..n;
float x[points]=...;
float y[points]=...;
int nbLines=2;
range lines=1..nbLines;
dvar float a[lines];
dvar float b[lines];
// distance between a point and a line
dvar float dist[points][lines];
// minimal distance to the closest line
dvar float dist2[points];
dvar float+ obj;
minimize obj;
subject to
{
obj==sum(i in points) dist2[i];
forall(i in points,j in lines) dist[i][j]==abs(b[j]*x[i]+a[j]-y[i]);
forall(i in points) dist2[i]==min(l in lines ) dist[i][l];
}
// which line for each point ?
int whichline[p in points]=first({l | l in lines : dist2[p]==dist[p][l]});
execute
{
writeln("b = ",b);
writeln("a = ",a);
writeln("which line = ",whichline);
}
gives
b = [0.6375 10]
a = [0.58125 0]
which line = [2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
With quadratic some reformulation
int n=...;
range points=1..n;
float x[points]=...;
float y[points]=...;
int nbLines=2;
range lines=1..nbLines;
dvar float a[lines];
dvar float b[lines];
dvar float distance[points][lines];
dvar boolean which[points]; // 1 means 1, 0 means 2
minimize sum(i in points,j in lines) distance[i][j]^2;
subject to
{
forall(i in points,j in 0..1) (which[i]==j) => (distance[i][2-j]==b[2-j]*x[i]+a[2-j]-y[i]);
}
execute
{
writeln("b = ",b);
writeln("a = ",a);
writeln("which = ",which);
}
gives
b = [0.61077 10]
a = [0.42613 0]
which = [0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Related

Keep uniform distribution after remapping to a new range

Since this is about remapping a uniform distribution to another with a different range, this is not a PHP question specifically although I am using PHP.
I have a cryptographicaly secure random number generator that gives me evenly distributed integers (uniform discrete distribution) between 0 and PHP_INT_MAX.
How do I remap these results to fit into a different range in an efficient manner?
Currently I am using $mappedRandomNumber = $randomNumber % ($range + 1) + $min where $range = $max - $min, but that obvioulsy doesn't work since the first PHP_INT_MAX%$range integers from the range have a higher chance to be picked, breaking the uniformity of the distribution.

Well, having zero knowledge of PHP definitely qualifies me as an expert, so
mentally converting to float U[0,1)
f = r / PHP_MAX_INT
then doing
mapped = min + f*(max - min)
going back to integers
mapped = min + (r * max - r * min)/PHP_MAX_INT
if computation is done via 64bit math, and PHP_MAX_INT being 2^31 it should work

This is what I ended up doing. PRNG 101 (if it does not fit, ignore and generate again). Not very sophisticated, but simple:
public function rand($min = 0, $max = null){
// pow(2,$numBits-1) calculated as (pow(2,$numBits-2)-1) + pow(2,$numBits-2)
// to avoid overflow when $numBits is the number of bits of PHP_INT_MAX
$maxSafe = (int) floor(
((pow(2,8*$this->intByteCount-2)-1) + pow(2,8*$this->intByteCount-2))
/
($max - $min)
) * ($max - $min);
// discards anything above the last interval N * {0 .. max - min -1}
// that fits in {0 .. 2^(intBitCount-1)-1}
do {
$chars = $this->getRandomBytesString($this->intByteCount);
$n = 0;
for ($i=0;$i<$this->intByteCount;$i++) {$n|=(ord($chars[$i])<<(8*($this->intByteCount-$i-1)));}
} while (abs($n)>$maxSafe);
return (abs($n)%($max-$min+1))+$min;
}
Any improvements are welcomed.
(Full code on https://github.com/elcodedocle/cryptosecureprng/blob/master/CryptoSecurePRNG.php)

Here is the sketch how I would do it:
Consider you have uniform random integer distribution in range [A, B) that's what your random number generator provide.
Let L = B - A.
Let P be the highest power of 2 such that P <= L.
Let X be a sample from this range.
First calculate Y = X - A.
If Y >= P, discard it and start with new X until you get an Y that fits.
Now Y contains log2(P) uniformly random bits - zero extend it up to log2(P) bits.
Now we have uniform random bit generator that can be used to provide arbitrary number of random bits as needed.
To generate a number in the target range, let [A_t, B_t) be the target range. Let L_t = B_t - A_t.
Let P_t be the smallest power of 2 such that P_t >= L_t.
Read log2(P_t) random bits and make an integer from it, let's call it X_t.
If X_t >= L_t, discard it and try again until you get a number that fits.
Your random number in the desired range will be L_t + A_t.
Implementation considerations: if your L_t and L are powers of 2, you never have to discard anything. If not, then even in the worst case you should get the right number in less than 2 trials on average.

Weights Optimization in matlab

I have to do optimization in supervised learning to get my weights.
I have to learn the values (w1,w2,w3,w4) such that whenever my vector A = [a1 a2 a3 a4] is 1 the sum w1*a1 + w2*a2 + w3*a3 + w4*a4 becomes greater than 0.5 and when its -1 ( labels ) then it becomes less than 0.5.
Can somebody tell me how I can approach this problem in Matlab ? One way that I know is to do it using evolutionary algorithms, taking a random value vector and then changing to pick the best n values.
Is there any other way that this can be approached ?

You can do it using linprog.
Let A be a matrix of size n by 4 consisting of all n training 4-vecotrs you have. You should also have a vector y with n elements (each either plus or minus 1), representing the label of each training 4-vecvtor.
Using A and y we can write a linear program (look at the doc for the names of the parameters I'm using). Now, you do not have an objective function, so you can simply set f to be f = zeros(4,1);.
The only thing you have is an inequality constraint (< a_i , w > - .5) * y_i >= 0 (where <.,.> is a dot-product between 4-vector a_i and weight vector w).
If my calculations are correct, this constraint can be written as
cmat = bsxfun( #times, A, y );
Overall you get
w = linprog( zeros(4,1), -cmat, .5*y );

convert real number to radicals

Suppose I have a real number. I want to approximate it with something of the form a+sqrt(b) for integers a and b. But I don't know the values of a and b. Of course I would prefer to get a good approximation with small values of a and b. Let's leave it undefined for now what is meant by "good" and "small". Any sensible definitions of those terms will do.
Is there a sane way to find them? Something like the continued fraction algorithm for finding fractional approximations of decimals. For more on the fractions problem, see here.
EDIT: To clarify, it is an arbitrary real number. All I have are a bunch of its digits. So depending on how good of an approximation we want, a and b might or might not exist. Brute force is naturally not a particularly good algorithm. The best I can think of would be to start adding integers to my real, squaring the result, and seeing if I come close to an integer. Pretty much brute force, and not a particularly good algorithm. But if nothing better exists, that would itself be interesting to know.
EDIT: Obviously b has to be zero or positive. But a could be any integer.

No need for continued fractions; just calculate the square-root of all "small" values of b (up to whatever value you feel is still "small" enough), remove everything before the decimal point, and sort/store them all (along with the b that generated it).
Then when you need to approximate a real number, find the radical whose decimal-portion is closet to the real number's decimal-portion. This gives you b - choosing the correct a is then a simple matter of subtraction.

This is actually more of a math problem than a computer problem, but to answer the question I think you are right that you can use continued fractions. What you do is first represent the target number as a continued fraction. For example, if you want to approximate pi (3.14159265) then the CF is:
3: 7, 15, 1, 288, 1, 2, 1, 3, 1, 7, 4 ...
The next step is create a table of CFs for square roots, then you compare the values in the table to the fractional part of the target value (here: 7, 15, 1, 288, 1, 2, 1, 3, 1, 7, 4...). For example, let's say your table had square roots for 1-99 only. Then you would find the closest match would be sqrt(51) which has a CF of 7: 7,14 repeating. The 7,14 is the closest to pi's 7,15. Thus your answer would be:
sqrt(51)-4
As the closest approximation given a b < 100 which is off by 0.00016. If you allow larger b's then you could get a better approximation.
The advantage of using CFs is that it is faster than working in, say, doubles or using floating point. For example, in the above case you only have to compare two integers (7 and 15), and you can also use indexing to make finding the closest entry in the table very fast.

This can be done using mixed integer quadratic programming very efficiently (though there are no run-time guarantees as MIQP is NP-complete.)
Define:
d := the real number you wish to approximate
b, a := two integers such that a + sqrt(b) is as "close" to d as possible
r := (d - a)^2 - b, is the residual of the approximation
The goal is to minimize r. Setup your quadratic program as:
x := [ s b t ]
D := | 1 0 0 |
| 0 0 0 |
| 0 0 0 |
c := [0 -1 0]^T
with the constraint that s - t = f (where f is the fractional part of d)
and b,t are integers (s is not)
This is a convex (therefore optimally solvable) mixed integer quadratic program since D is positive semi-definite.
Once s,b,t are computed, simply derive the answer using b=b, s=d-a and t can be ignored.
Your problem may be NP-complete, it would be interesting to prove if so.

Some of the previous answers use methods that are of time or space complexity O(n), where n is the largest “small number” that will be accepted. By contrast, the following method is O(sqrt(n)) in time, and O(1) in space.
Suppose that positive real number r = x + y, where x=floor(r) and 0 ≤ y < 1. We want to approximate r by a number of the form a + √b. If x+y ≈ a+√b then x+y-a ≈ √b, so √b ≈ h+y for some integer offset h, and b ≈ (h+y)^2. To make b an integer, we want to minimize the fractional part of (h+y)^2 over all eligible h. There are at most √n eligible values of h. See following python code and sample output.
import math, random
def findb(y, rhi):
bestb = loerror = 1;
for r in range(2,rhi):
v = (r+y)**2
u = round(v)
err = abs(v-u)
if round(math.sqrt(u))**2 == u: continue
if err < loerror:
bestb, loerror = u, err
return bestb
#random.seed(123456) # set a seed if testing repetitively
f = [math.pi-3] + sorted([random.random() for i in range(24)])
print (' frac sqrt(b) error b')
for frac in f:
b = findb(frac, 12)
r = math.sqrt(b)
t = math.modf(r)[0] # Get fractional part of sqrt(b)
print ('{:9.5f} {:9.5f} {:11.7f} {:5.0f}'.format(frac, r, t-frac, b))
(Note 1: This code is in demo form; the parameters to findb() are y, the fractional part of r, and rhi, the square root of the largest small number. You may wish to change usage of parameters. Note 2: The
if round(math.sqrt(u))**2 == u: continue
line of code prevents findb() from returning perfect-square values of b, except for the value b=1, because no perfect square can improve upon the accuracy offered by b=1.)
Sample output follows. About a dozen lines have been elided in the middle. The first output line shows that this procedure yields b=51 to represent the fractional part of pi, which is the same value reported in some other answers.
frac sqrt(b) error b
0.14159 7.14143 -0.0001642 51
0.11975 4.12311 0.0033593 17
0.12230 4.12311 0.0008085 17
0.22150 9.21954 -0.0019586 85
0.22681 11.22497 -0.0018377 126
0.25946 2.23607 -0.0233893 5
0.30024 5.29150 -0.0087362 28
0.36772 8.36660 -0.0011170 70
0.42452 8.42615 0.0016309 71
...
0.93086 6.92820 -0.0026609 48
0.94677 8.94427 -0.0024960 80
0.96549 11.95826 -0.0072333 143
0.97693 11.95826 -0.0186723 143
With the following code added at the end of the program, the output shown below also appears. This shows closer approximations for the fractional part of pi.
frac, rhi = math.pi-3, 16
print (' frac sqrt(b) error b bMax')
while rhi < 1000:
b = findb(frac, rhi)
r = math.sqrt(b)
t = math.modf(r)[0] # Get fractional part of sqrt(b)
print ('{:11.7f} {:11.7f} {:13.9f} {:7.0f} {:7.0f}'.format(frac, r, t-frac, b,rhi**2))
rhi = 3*rhi/2
frac sqrt(b) error b bMax
0.1415927 7.1414284 -0.000164225 51 256
0.1415927 7.1414284 -0.000164225 51 576
0.1415927 7.1414284 -0.000164225 51 1296
0.1415927 7.1414284 -0.000164225 51 2916
0.1415927 7.1414284 -0.000164225 51 6561
0.1415927 120.1415831 -0.000009511 14434 14641
0.1415927 120.1415831 -0.000009511 14434 32761
0.1415927 233.1415879 -0.000004772 54355 73441
0.1415927 346.1415895 -0.000003127 119814 164836
0.1415927 572.1415909 -0.000001786 327346 370881
0.1415927 911.1415916 -0.000001023 830179 833569

I do not know if there is any kind of standard algorithm for this kind of problem, but it does intrigue me, so here is my attempt at developing an algorithm that finds the needed approximation.
Call the real number in question r. Then, first I assume that a can be negative, in that case we can reduce the problem and now only have to find a b such that the decimal part of sqrt(b) is a good approximation of the decimal part of r. Let us now write r as r = x.y with x being the integer and y the decimal part.
Now:
b = r^2
= (x.y)^2
= (x + .y)^2
= x^2 + 2 * x * .y + .y^2
= 2 * x * .y + .y^2 (mod 1)
We now only have to find an x such that 0 = .y^2 + 2 * x * .y (mod 1) (approximately).
Filling that x into the formulas above we get b and can then calculate a as a = r - b. (All of these calculations have to be carefully rounded of course.)
Now, for the time being I am not sure if there is a way to find this x without brute forcing it. But even then, one can simple use a simple loop to find an x good enough.
I am thinking of something like this(semi pseudo code):
max_diff_low = 0.01 // arbitrary accuracy
max_diff_high = 1 - max_diff_low
y = r % 1
v = y^2
addend = 2 * y
x = 0
while (v < max_diff_high && v > max_diff_low)
x++;
v = (v + addend) % 1
c = (x + y) ^ 2
b = round(c)
a = round(r - c)
Now, I think this algorithm is fairly efficient, while even allowing you to specify the wished accuracy of the approximation. One thing that could be done that would turn it into an O(1) algorithm is calculating all the x and putting them into a lookup table. If one only cares about the first three decimal digits of r(for example), the lookup table would only have 1000 values, which is only 4kb of memory(assuming that 32bit integers are used).
Hope this is helpful at all. If anyone finds anything wrong with the algorithm, please let me know in a comment and I will fix it.
EDIT:
Upon reflection I retract my claim of efficiency. There is in fact as far as I can tell no guarantee that the algorithm as outlined above will ever terminate, and even if it does, it might take a long time to find a very large x that solves the equation adequately.
One could maybe keep track of the best x found so far and relax the accuracy bounds over time to make sure the algorithm terminates quickly, at the possible cost of accuracy.
These problems are of course non-existent, if one simply pre-calculates a lookup table.

How can I efficiently calculate the binomial cumulative distribution function?

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).
I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.
So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.
If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.
I'm aware that this problem relates to finding the area under a binomial curve, however:
My math-fu is not up to the task of translating this knowledge into efficient code
While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.
I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.
I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.

Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323

I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.

I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.

I can't totally vouch for the efficiency, but Scipy has a module for this
from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)

An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.
I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials
Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.
Lets say I know the probability of throwing a heads with a particular coin is P.
What is the probability of me throwing
the coin T times and getting at least
S heads?
Set n = T
Set beta[i] = 0 for i = 0, ... S - 1
Set beta[i] = 1 for i = S, ... T
Set t = p
Evaluate B(t) using de Casteljau
or at most S heads?
Set n = T
Set beta[i] = 1 for i = 0, ... S
Set beta[i] = 0 for i = S + 1, ... T
Set t = p
Evaluate B(t) using de Casteljau
Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.

If you are using Python, no need to code it yourself. Scipy got you covered:
from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434
# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777

From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.

The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.
If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.

Try this one, used in GMP. Another reference is this.

import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of
heads percentage and 10000 times
done.
sum(x>12)/len(x)
The output is 41% of times we got 12 heads.

"Approximate" greatest common divisor

Suppose you have a list of floating point numbers that are approximately multiples of a common quantity, for example
2.468, 3.700, 6.1699
which are approximately all multiples of 1.234. How would you characterize this "approximate gcd", and how would you proceed to compute or estimate it?
Strictly related to my answer to this question.

You can run Euclid's gcd algorithm with anything smaller then 0.01 (or a small number of your choice) being a pseudo 0. With your numbers:
3.700 = 1 * 2.468 + 1.232,
2.468 = 2 * 1.232 + 0.004.
So the pseudo gcd of the first two numbers is 1.232. Now you take the gcd of this with your last number:
6.1699 = 5 * 1.232 + 0.0099.
So 1.232 is the pseudo gcd, and the mutiples are 2,3,5. To improve this result, you may take the linear regression on the data points:
(2,2.468), (3,3.7), (5,6.1699).
The slope is the improved pseudo gcd.
Caveat: the first part of this is algorithm is numerically unstable - if you start with very dirty data, you are in trouble.

Express your measurements as multiples of the lowest one. Thus your list becomes 1.00000, 1.49919, 2.49996. The fractional parts of these values will be very close to 1/Nths, for some value of N dictated by how close your lowest value is to the fundamental frequency. I would suggest looping through increasing N until you find a sufficiently refined match. In this case, for N=1 (that is, assuming X=2.468 is your fundamental frequency) you would find a standard deviation of 0.3333 (two of the three values are .5 off of X * 1), which is unacceptably high. For N=2 (that is, assuming 2.468/2 is your fundamental frequency) you would find a standard deviation of virtually zero (all three values are within .001 of a multiple of X/2), thus 2.468/2 is your approximate GCD.
The major flaw in my plan is that it works best when the lowest measurement is the most accurate, which is likely not the case. This could be mitigated by performing the entire operation multiple times, discarding the lowest value on the list of measurements each time, then use the list of results of each pass to determine a more precise result. Another way to refine the results would be adjust the GCD to minimize the standard deviation between integer multiples of the GCD and the measured values.

This reminds me of the problem of finding good rational-number approximations of real numbers. The standard technique is a continued-fraction expansion:
def rationalizations(x):
assert 0 <= x
ix = int(x)
yield ix, 1
if x == ix: return
for numer, denom in rationalizations(1.0/(x-ix)):
yield denom + ix * numer, numer
We could apply this directly to Jonathan Leffler's and Sparr's approach:
>>> a, b, c = 2.468, 3.700, 6.1699
>>> b/a, c/a
(1.4991896272285252, 2.4999594813614263)
>>> list(itertools.islice(rationalizations(b/a), 3))
[(1, 1), (3, 2), (925, 617)]
>>> list(itertools.islice(rationalizations(c/a), 3))
[(2, 1), (5, 2), (30847, 12339)]
picking off the first good-enough approximation from each sequence. (3/2 and 5/2 here.) Or instead of directly comparing 3.0/2.0 to 1.499189..., you could notice than 925/617 uses much larger integers than 3/2, making 3/2 an excellent place to stop.
It shouldn't much matter which of the numbers you divide by. (Using a/b and c/b you get 2/3 and 5/3, for instance.) Once you have integer ratios, you could refine the implied estimate of the fundamental using shsmurfy's linear regression. Everybody wins!

I'm assuming all of your numbers are multiples of integer values. For the rest of my explanation, A will denote the "root" frequency you are trying to find and B will be an array of the numbers you have to start with.
What you are trying to do is superficially similar to linear regression. You are trying to find a linear model y=mx+b that minimizes the average distance between a linear model and a set of data. In your case, b=0, m is the root frequency, and y represents the given values. The biggest problem is that the independent variables X are not explicitly given. The only thing we know about X is that all of its members must be integers.
Your first task is trying to determine these independent variables. The best method I can think of at the moment assumes that the given frequencies have nearly consecutive indexes (x_1=x_0+n). So B_0/B_1=(x_0)/(x_0+n) given a (hopefully) small integer n. You can then take advantage of the fact that x_0 = n/(B_1-B_0), start with n=1, and keep ratcheting it up until k-rnd(k) is within a certain threshold. After you have x_0 (the initial index), you can approximate the root frequency (A = B_0/x_0). Then you can approximate the other indexes by finding x_n = rnd(B_n/A). This method is not very robust and will probably fail if the error in the data is large.
If you want a better approximation of the root frequency A, you can use linear regression to minimize the error of the linear model now that you have the corresponding dependent variables. The easiest method to do so uses least squares fitting. Wolfram's Mathworld has a in-depth mathematical treatment of the issue, but a fairly simple explanation can be found with some googling.

Interesting question...not easy.
I suppose I would look at the ratios of the sample values:
3.700 / 2.468 = 1.499...
6.1699 / 2.468 = 2.4999...
6.1699 / 3.700 = 1.6675...
And I'd then be looking for a simple ratio of integers in those results.
1.499 ~= 3/2
2.4999 ~= 5/2
1.6675 ~= 5/3
I haven't chased it through, but somewhere along the line, you decide that an error of 1:1000 or something is good enough, and you back-track to find the base approximate GCD.

The solution which I've seen and used myself is to choose some constant, say 1000, multiply all numbers by this constant, round them to integers, find the GCD of these integers using the standard algorithm and then divide the result by the said constant (1000). The larger the constant, the higher the precision.

This is a reformulaiton of shsmurfy's solution when you a priori choose 3 positive tolerances (e1,e2,e3)
The problem is then to search smallest positive integers (n1,n2,n3) and thus largest root frequency f such that:
f1 = n1*f +/- e1
f2 = n2*f +/- e2
f3 = n3*f +/- e3
We assume 0 <= f1 <= f2 <= f3
If we fix n1, then we get these relations:
f is in interval I1=[(f1-e1)/n1 , (f1+e1)/n1]
n2 is in interval I2=[n1*(f2-e2)/(f1+e1) , n1*(f2+e2)/(f1-e1)]
n3 is in interval I3=[n1*(f3-e3)/(f1+e1) , n1*(f3+e3)/(f1-e1)]
We start with n1 = 1, then increment n1 until the interval I2 and I3 contain an integer - that is floor(I2min) different from floor(I2max) same with I3
We then choose smallest integer n2 in interval I2, and smallest integer n3 in interval I3.
Assuming normal distribution of floating point errors, the most probable estimate of root frequency f is the one minimizing
J = (f1/n1 - f)^2 + (f2/n2 - f)^2 + (f3/n3 - f)^2
That is
f = (f1/n1 + f2/n2 + f3/n3)/3
If there are several integers n2,n3 in intervals I2,I3 we could also choose the pair that minimize the residue
min(J)*3/2=(f1/n1)^2+(f2/n2)^2+(f3/n3)^2-(f1/n1)*(f2/n2)-(f1/n1)*(f3/n3)-(f2/n2)*(f3/n3)
Another variant could be to continue iteration and try to minimize another criterium like min(J(n1))*n1, until f falls below a certain frequency (n1 reaches an upper limit)...

I found this question looking for answers for mine in MathStackExchange (here and here).
I've only managed (yet) to measure the appeal of a fundamental frequency given a list of harmonic frequencies (following the sound/music nomenclature), which can be useful if you have a reduced number of options and is feasible to compute the appeal of each one and then choose the best fit.
C&P from my question in MSE (there the formatting is prettier):
being v the list {v_1, v_2, ..., v_n}, ordered from lower to higher
mean_sin(v, x) = sum(sin(2*pi*v_i/x), for i in {1, ...,n})/n
mean_cos(v, x) = sum(cos(2*pi*v_i/x), for i in {1, ...,n})/n
gcd_appeal(v, x) = 1 - sqrt(mean_sin(v, x)^2 + (mean_cos(v, x) - 1)^2)/2, which yields a number in the interval [0,1].
The goal is to find the x that maximizes the appeal. Here is the (gcd_appeal) graph for your example [2.468, 3.700, 6.1699], where you find that the optimum GCD is at x = 1.2337899957639993
Edit:
You may find handy this JAVA code to calculate the (fuzzy) divisibility (aka gcd_appeal) of a divisor relative to a list of dividends; you can use it to test which of your candidates makes the best divisor. The code looks ugly because I tried to optimize it for performance.
//returns the mean divisibility of dividend/divisor as a value in the range [0 and 1]
// 0 means no divisibility at all
// 1 means full divisibility
public double divisibility(double divisor, double... dividends) {
double n = dividends.length;
double factor = 2.0 / divisor;
double sum_x = -n;
double sum_y = 0.0;
double[] coord = new double[2];
for (double v : dividends) {
coordinates(v * factor, coord);
sum_x += coord[0];
sum_y += coord[1];
}
double err = 1.0 - Math.sqrt(sum_x * sum_x + sum_y * sum_y) / (2.0 * n);
//Might happen due to approximation error
return err >= 0.0 ? err : 0.0;
}
private void coordinates(double x, double[] out) {
//Bhaskara performant approximation to
//out[0] = Math.cos(Math.PI*x);
//out[1] = Math.sin(Math.PI*x);
long cos_int_part = (long) (x + 0.5);
long sin_int_part = (long) x;
double rem = x - cos_int_part;
if (cos_int_part != sin_int_part) {
double common_s = 4.0 * rem;
double cos_rem_s = common_s * rem - 1.0;
double sin_rem_s = cos_rem_s + common_s + 1.0;
out[0] = (((cos_int_part & 1L) * 8L - 4L) * cos_rem_s) / (cos_rem_s + 5.0);
out[1] = (((sin_int_part & 1L) * 8L - 4L) * sin_rem_s) / (sin_rem_s + 5.0);
} else {
double common_s = 4.0 * rem - 4.0;
double sin_rem_s = common_s * rem;
double cos_rem_s = sin_rem_s + common_s + 3.0;
double common_2 = ((cos_int_part & 1L) * 8L - 4L);
out[0] = (common_2 * cos_rem_s) / (cos_rem_s + 5.0);
out[1] = (common_2 * sin_rem_s) / (sin_rem_s + 5.0);
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fitting two lines to a set of 2D points - algorithm

Related

Keep uniform distribution after remapping to a new range

Weights Optimization in matlab

convert real number to radicals

How can I efficiently calculate the binomial cumulative distribution function?

"Approximate" greatest common divisor

Categories

Resources