HMM for solving given coin output

HMM for solving given coin output - algorithm

I have got this assignment question on HMM and I have solved it. I would like to know if I am correct. The problem is:
Suppose a dishonest dealer has two coins, one fair and one biased; the biased coin
has heads probability 1/4. Assume that the dealer never switches the coins. Which
coin is more likely to have generated the sequence HTTTHHHTTTTHTHHTT? It may
be useful to know that log2(3) = 1.585
I calculated the P for fair coin and biased coin.
The P for fair coin is 7.6*10-6 where as P for biased coin is 3.43*10-6. I didn't use log term, which can be used if I solve it the other way. So, I concluded that it is more likely that the given sequence is generated by a fair coin.
Am I right?
Any help is greatly appreciated.

So you are given the following.
P(H|Fake) = 1/4 P(T|Fake) = 3/4
P(H|Fair) = 1/2 P(T|Fair) = 1/2
P(Fair) = 1/2 P(Fake) = 1/2
To answer the question you need to answer P(Fake/HTTTHHHTTTTHTHHTT) and P(Fair/HTTTHHHTTTTHTHHTT) for which you need to apply bayes:
Let X be HTTTHHHTTTTHTHHTT
P(Fake|X) = (P(X|Fake) * P(Fake)) / P(X)
P(Fair|X) = (P(X|Fair) * P(Fair)) / P(X)
Where
P(X) = P(X|Fake) * P(Fake) + P(X|Fair) * P(Fair)
P(X) = (3.43710e-6 * 0.5) + (7.629e-6 * 0.5) = 5.533e-6
And therefore
P(Fake|X) = (3.43710e-6 * 0.5) / 5.533e-6 = 0.3106
P(Fair|X) = (7.629e-6 * 0.5) / 5.533e-6 = 0.6894
So therefore, is more likely that the used coin is the FAIR one. Even though intuitively one might think that the selected coin is the Fake it seems that this is not the case. The given distribution is closer to 0.5 tail 0.5 heads than to 0.25 heads 0.75 tails. For example, in the case of tails 10/17 is 0.58 that is closer to P(T|Fair)=.5 than to P(T|Fake)=.75

HMM is a bit of an overkill for this example. The probability of getting heads in binomially distributed, with p = 0.5 for the fair coin and p = 0.25 for the other one. For both of them, the number of trials n = 17 (if my counting is correct). From the 17 samples you got 7 successes (7 heads). Using Wolfram Alpha, the probability of the fair coin generating this sample is approx 0.15, as opposed to approx 0.07 for the unfair coin. Note I did not bother calculating the exact numbers, just looked at the plots. The formula is there for you to work with if you want to.
EDIT
If you absolutely must use a HMM, set the set of hidden states to be {fair; unfair} . The transition probabilities are: from a hidden state "fair" to a hidden state "fair"= 1, from a fair to unfair 0, etc, because the dealer is not allowed to change coins halfway through the trial. The emission probability from a hidden state "fair" are 0.5 for observable state "heads" and 0.5 for observable state "tails" (0.25 and 0.75 from "unfair"). You can assume at time t=0 hidden state "fair" and "unfair" are equally likely.

Related

Better than brute force algorithms for a coin-flipping game

I have a problem and I feel like there should be a well-known algorithm for solving it that's better than just brute force, but I can't think of one, so I'm asking here.
The problem is as follows: given n sorted (from low to high) lists containing m probabilities, choose one index for each list such that the sum of the chosen indexes is less than m. Then, for each list, we flip a coin, where the chance of it landing heads is equal to the probability at the chosen index for that list. Maximize the chance of the coin landing heads at least once.
Are there any algorithms for solving this problem that are better than just brute force?
This problem seems most similar to the knapsack problem, except the value of the items in the knapsack isn't merely a sum of the items in the knapsack. (Written in Python, instead of sum(p for p in chosen_probabilities) it's 1 - math.prod([1 - p for p in chosen_probabilities])) And, there's restrictions on what items you can add given what items are already in the knapsack. For example, if the index = 3 item for a particular list is already in the knapsack, then adding in the item with index = 2 for that same list isn't allowed (since you can only pick one index for each list). So there are certain items that can and can't be added to the knapsack based on what items are already in it.
Linear optimization won't work because the values in the lists don't increase linearly, the final coin probability isn't linear with respect to the chosen probabilities, and our constraint is on the sum of the indexes, rather than the values in the lists themselves. As David has pointed out, linear optimization will work if you use binary variables to pick out the indexes and a logarithm to deal with the non-linearity.
EDIT:
I've found that explaining the motivation behind this problem can be helpful for understanding it. Imagine you have 10 seconds to solve a problem, and three different ways to solve it. You have models of how likely it is that each method will solve the problem, given how many seconds you try that method for, but if you switch methods, you lose all progress on the one you were previously trying. What methods should you try and for how long?

Maximizing 1 - math.prod([1 - p for p in chosen_probabilities]) is equivalent to minimizing math.prod([1 - p for p in chosen_probabilities]), which is equivalent to minimizing the log of this objective, which is a linear function of 0-1 indicator variables, so you could do an integer programming formulation this way.
I can't promise that this will be much better than brute force. The problem is that math.log(1 - p) is well approximated by -p when p is close to zero. My intuition is that for nontrivial instances it will be qualitatively similar to using integer programming to solve subset sum, which doesn't go particularly well.
If you're willing to settle for a bicriteria approximation scheme (get an answer such that the sum of the chosen indexes is less than m, that is at least as good as the best answer summing to less than (1 − ε) m) then you can round up the probability to multiples of ε and use dynamic programming to get an algorithm that runs in time polynomial in n, m, 1/ε.

Here is working code for David Eisenstat's solution.
To understand the implementation, I think it helps to go through the math first.
As a reminder, there are n lists, each with m options. (In the motivating example at the bottom of the question, each list represents a method for solving the problem, and you are given m-1 seconds to solve the problem. Each list is such that list[index] gives the chance of solving the problem with that method if the method is run for index seconds.)
We let the lists be stored in a matrix called d (named data in the code), where each row in the matrix is a list. (And thus each column represents an index, or, if following the motivating example, an amount of time.)
The probability of the coin landing heads, given that we chose index j* for list i, is computed as
We would like to maximize this.
(To explain the stats behind this equation, we're computing 1 minus the probability that the coin doesn't land on heads. The probability that the coin doesn't land on heads is the probability that each flip doesn't land on heads. The probability that a single flip doesn't land on heads is just 1 minus the probability that does land on heads. And the probability it does land on heads is the number we've chosen, d[i][j*]. Thus, the total probability that all the flips land on tails is just the product of the probability that each one lands on tails. And then the probability that the coin lands on heads is just 1 minus the probability that all the flips land on tails.)
Which, as David pointed out, is the same as minimizing:
Which is the same as minimizing:
Which is equivalent to:
Then, since this is linear sum, we can turn it into an integer program.
We'll be minimizing:
This lets the computer choose the indexes by allowing it to create an n by m matrix of 1s and 0s called x where the 1s pick out particular indexes. We'll then define rules so that it doesn't pick out invalid sets of indexes.
The first rule is that you have to pick out an index for each list:
The second rule is that you have to respect the constraint that the indexes chosen must sum to m or less:
And that's it! Then we can just tell the computer to minimize that sum according to those rules. It will spit out an x matrix with a single 1 on each row to tell us which index it has picked for the list on that row.
In code (using the motivating example), this is implemented as:
'''
Requirements:
cvxopt==1.2.6
cvxpy==1.1.10
ecos==2.0.7.post1
numpy==1.20.1
osqp==0.6.2.post0
qdldl==0.1.5.post0
scipy==1.6.1
scs==2.1.2
'''
import math
import cvxpy as cp
import numpy as np
# number of methods
n = 3
# if you have 10 seconds, there are 11 options for each method (0 seconds, 1 second, ..., 10 seconds)
m = 11
# method A has 30% chance of working if run for at least 3 seconds
# equivalent to [0, 0, 0, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]
A_list = [0, 0, 0] + [0.3] * (m - 3)
# method B has 30% chance of working if run for at least 3 seconds
# equivalent to [0, 0, 0, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]
B_list = [0, 0, 0] + [0.3] * (m - 3)
# method C has 40% chance of working if run for 4 seconds, 30% otherwise
# equivalent to [0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]
C_list = [0.3, 0.3, 0.3, 0.3] + [0.4] * (m - 4)
data = [A_list, B_list, C_list]
# do the logarithm
log_data = []
for row in data:
log_row = []
for col in row:
# deal with domain exception
if col == 1:
new_col = float('-inf')
else:
new_col = math.log(1 - col)
log_row.append(new_col)
log_data.append(log_row)
log_data = np.array(log_data)
x = cp.Variable((n, m), boolean=True)
objective = cp.Minimize(cp.sum(cp.multiply(log_data, x)))
# the current solver doesn't work with equalities, so each equality must be split into two inequalities.
# see https://github.com/cvxgrp/cvxpy/issues/1112
one_choice_per_method_constraint = [cp.sum(x[i]) <= 1 for i in range(n)] + [cp.sum(x[i]) >= 1 for i in range(n)]
# constrain the solution to not use more time than is allowed
# note that the time allowed is (m - 1), not m, because time is 1-indexed and the lists are 0-indexed
js = np.tile(np.array(list(range(m))), (n, 1))
time_constraint = [cp.sum(cp.multiply(js, x)) <= m - 1, cp.sum(cp.multiply(js, x)) >= m - 1]
constraints = one_choice_per_method_constraint + time_constraint
prob = cp.Problem(objective, constraints)
result = prob.solve()
def compute_probability(data, choices):
# compute 1 - ((1 - p1) * (1 - p2) * ...)
return 1 - np.prod(np.add(1, -np.multiply(data, choices)))
print("Choices:")
print(x.value)
'''
Choices:
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]
'''
print("Chance of success:")
print(compute_probability(data, x.value))
'''
Chance of success:
0.7060000000000001
'''
And there we have it! The computer has correctly determined that running method A for 3 seconds, method B for 3 seconds, and method C for 4 seconds is optimal. (Remember that the x matrix is 0-indexed, while the times are 1-indexed.)
Thank you, David, for the suggestion!

Q learning - epsilon greedy update

I am trying to understand the epsilon - greedy method in DQN. I am learning from the code available in https://github.com/karpathy/convnetjs/blob/master/build/deepqlearn.js
Following is the update rule for epsilon which changes with age as below:
$this.epsilon = Math.min(1.0, Math.max(this.epsilon_min, 1.0-(this.age - this.learning_steps_burnin)/(this.learning_steps_total - this.learning_steps_burnin)));
Does this mean the epsilon value starts with min (chosen by user) and then increase with age reaching upto burnin steps and eventually becoming to 1? Or Does the epsilon start around 1 and then decays to epsilon_min ?
Either way, then the learning almost stops after this process. So, do we need to choose the learning_steps_burnin and learning_steps_total carefully enough? Any thoughts on what value needs to be chosen?

Since epsilon denotes the amount of randomness in your policy (action is greedy with probability 1-epsilon and random with probability epsilon), you want to start with a fairly randomized policy and later slowly move towards a deterministic policy. Therefore, you usually start with a large epsilon (like 0.9, or 1.0 in your code) and decay it to a small value (like 0.1). Most common and simple approaches are linear decay and exponential decay. Usually, you have an idea of how many learning steps you will perform (what in your code is called learning_steps_total) and tune the decay factor (your learning_steps_burnin) such that in this interval epsilon goes from 0.9 to 0.1.
Your code is an example of linear decay.
An example of exponential decay is
epsilon = 0.9
decay = 0.9999
min_epsilon = 0.1
for i from 1 to n
epsilon = max(min_epsilon, epsilon*decay)

Personally I recommend an epsilon decay such that after about 50/75% of the training you reach the minimum value of espilon (advice from 0.05 to 0.0025) from which then you have only the improvement of the policy itself.
I created a specific script to set the various parameters and it returns after what the decay stop is reached (at the indicated value)
import matplotlib.pyplot as plt
import numpy as np
eps_start = 1.0
eps_min = 0.05
eps_decay = 0.9994
epochs = 10000
pct = 0
df = np.zeros(epochs)
for i in range(epochs):
if i == 0:
df[i] = eps_start
else:
df[i] = df[i-1] * eps_decay
if df[i] <= eps_min:
print(i)
stop = i
break
print("With this parameter you will stop epsilon decay after {}% of training".format(stop/epochs*100))
plt.plot(df)
plt.show()

genetic algorithm handling negative fitness values

I am trying to implement genetic algorithm for maximizing a function of n variables. However the problem is that the fitness values can be negative and I am not sure about how to handle negative values while doing selection. I read this article Linear fitness scaling in Genetic Algorithm produces negative fitness values
but it's not clear to me how the negative fitness values were taken care of and how scaling factors a and b were calculated.
Also, from the article I know that roulette wheel selection only works for positive fitness value. Is it the same for tournament selection as well ?

When you have negative values, you could try to find the smallest fitness value in your population and add its opposite to every value. This way you will no longer have negative values, while the differences between fitness values will remain the same.

Tournament selection is not affected by this problem. It simply compares the fitness values of a uniformly sampled subset of size n of the population and takes the one with the best value. Still of course this means that, if you sample without repetition then the worst n-1 individuals will never get selected. If you sample with repetition they have a chance of being selected.
As with proportional selection: It doesn't work with negative fitness values. You can only apply "windowing" or "scaling" of your fitness values in which case they work again.
I once programmed some sampling methods as extension methods for C#'s IEnumerable among them is a SampleProportional and SampleProportionalWithoutRepetition extension method. They're part of HeuristicLab under GPL license.

Okay, it's late to answer, but still someone could google it.
First of all - yes, you can use negative fitness. But I'm totally suggest you not to do it, because I've did it and experienced a lot of problems (still doable, but totally not recommended). So here's explanation:
Say you have population of N creatures. After simulation they all have some fitness values f(n), where f(n) is fitness and n is creature number. After this you want to build some probability distribution to determine which creatures should be killed (of course you can delete say 40% of just worst creatures but it would be better if you use distribution). How do you build such distribution? Say f(a) = 50, and f(b) = 100, so creature b is 2 times better than creature a, so probably you want to make the
survival probability of creature a 2 times higher than creature b (makes great sense if your fitness value is linear). In case you wonder how to do it:
Let's say that sum( f (n) ) is the summ of all fitness values. Then
survival probability p(a) of creature a is:
p(a) = f(a) / sum( f(n) )
This will do the trick.
But now let's make negative fitness allowed. Say f(a) = 50, f(b) = 100, f(c) = -1000. b is again 2 times better than a, makes sense, but it's -10 times better than c? Doesn't make sense. Gentleman above suggested you to add oppositive of worst fitness value, which kinda can "fix" your situation, but really it don't (I maked same mistake before). Okay, let's add 1000 to all fitness values:
f(a) = 1050, f(b) = 1100, f(c) = 0, so survival probability of c is zero now, okay, we can take it. But let's compare a and b now:
b is 1.05 better than a now, which means that fitness of a and b is almost the same, which is totally unacceptable, because it clearly was 2 times better than a (of course in assumption that fitness is linear, but this will mess up nonlinear fitnesses as well)! You can't escape this problem, it will constantly get in your way, because probability can't be negative, so you either can remove the probability from evolution (which is not very good thing to do) or you can do some exceptions and tricks.
Since it was too late in my scenario to remove negative fitness, here's my way in order to fix things up:
Once again, you have population of N creatures. Say neg(N) gives you all negative fitness creatures and pos(N) positive fitness creatures (it's your call to make zero negative or positive, doesn't matter in this case). And let's say you need D creatures to die. And now here's the trick:
the higher f( c ) ( c is pos creature) value, the better creature is, so we can use its fitness to determine the probability of survivial. But the lower (bigger negative) f( m ) (m is neg creature ), the worser creature is, so we can use its fitness to determine the probability of dying.
Now, if D > neg(N) then all neg(N) will die, and (D-neg(N)) of pos(N) will die with use of probability distribution based on all positive creatures fitness (probability of survival p(a) = f(a) / sum( pos(n) ) ). But if D < neg(N), then all pos(N) will survive, and D of neg(N) creatures will die with use of probability distribution based on all negative creatures fitness (probability of dying p(a) = f(a) / sum( neg(n) ) (f(a) will be negative, but sum( neg(n) ) will be negative as well, so probability will be positive).

I know this question has been here for a long time, but if new guys want to know the best way to deal with negative values, and also your problem is minimum. Here is the code for it.
from numpy import min, sum, ptp, array
from numpy.random import uniform
list_fitness1 = array([-12, -45, 0, 72.1, -32.3])
list_fitness2 = array([0.5, 6.32, 988.2, 1.23])
def get_index_roulette_wheel_selection(list_fitness=None):
""" It can handle negative also. Make sure your list fitness is 1D-numpy array"""
scaled_fitness = (list_fitness - min(list_fitness)) / ptp(list_fitness)
minimized_fitness = 1.0 - scaled_fitness
total_sum = sum(minimized_fitness)
r = uniform(low=0, high=total_sum)
for idx, f in enumerate(minimized_fitness):
r = r + f
if r > total_sum:
return idx
get_index_roulette_wheel_selection(list_fitness1)
get_index_roulette_wheel_selection(list_fitness2)
Make sure your fitness list is 1D-numpy array
Scaled the fitness list to the range [0, 1]
Transform maximum problem to minimum problem by 1.0 - scaled_fitness_list
Random a number between 0 and sum(minimizzed_fitness_list)
Keep adding element in minimized fitness list until we get the value greater than the total sum
You can see if the fitness is small --> it has bigger value in minimized_fitness --> It has a bigger chance to add and make the value greater than the total sum.

I think that the main issue people are running into here is that they're treating the fitness score improperly. Let's think about an example fitness score as the temperature inside of a truck shipping frozen goods. The truck's internal temperature should be -2 C... but that's also 28.4 F. They are the same exact fitness relative to the food staying frozen, but 2 * -2 = -4, and 2 * 28.4 is 56.8. "Two times colder" doesn't really make any sense here (-4 C != 14.2 F either). Same with fitness scores.
In the case of -1000 in Volot's example, the difference between 50 and 100 is actually comparatively low: the important thing is that you'd pick either / both of those over the -1000, which you will definitely do if you just subtract -1000 from everything. Then the next generation of children may have fitness scores of 50, 100, 200, and 10, let's say. Now the difference between 50 and 100 is much more pronounced, and 50 will have a much lower chance of getting picked. Remember genetic algorithms are iterative. It also reminds me of a saying: You don’t have to run faster than the bear to get away. You just have to run faster than the guy next to you. 50 just needs to outrun -1000 to survive to reproduce.
The problem of subtracting the min resulting in 0 also can be avoided. When estimating probability distributions, people will add 1 occurrence to every (known) possible outcome so that extremely rare events are still captured. That gets somewhat trickier with fitness scores. You can't just add 1. What if your fitness scores are 0.01, 0.02, and -0.01? 1.03, 1.02, and 1.00 are going to result in picking a low relative fitness a lot. You can instead add the lowest non-zero value to everything, resulting in 0.04, 0.05, and 0.02. For the -1000 case, it results in 2150, 2100, and 1050 (so everything that used to be 0 will always be half as likely as the next lowest fitness to get picked)
Still, to make things as consistent as possible with what is a more typical GA sampling method, I would only subtract the min and add back in a small amount of fitness when there are negative values. When everything is positive, there's no reason to do it.

Algorithm for Finding a possible profit

I 'm searching for an algorithm (and except the naive brute force solution had no luck) that efficiently (O(n^2) preferably) does the following:
Supposing I’m playing a game and in this game I’ll have to answer n questions (each question from a different category). For each category “i” i=1,...,n I’ve calculated the probability p_i to give a correct answer.
For each consecutive k correct answers I’m getting k^4 points. What is the expected average profit?
I will clarify what I mean by expected profit in the following example:
In the case n=3 and p_1=0.2,p_2=0.3,p_3=0.4
The expected profit is
EP= (0.2* 0.3* 0.4 )3^4+ (I get all 3 answers correct)
(0.2* 0.3* 0.6 )2^4+ (0.8* 0.3* 0.4 )2^4+ (0.2* 0.7* 0.4 )2+ (2 answers correct)
0.2* 0.7* 0.6 ) + (0.8* 0.3* 0.6 )+ (0.8*0.7* 0.4 ) (1 answer correct)
clearly for each possible outcome I'm calculating the probability and multiply it with the points gained. And then get the sum off all those.
Any ideas?
I'm only interested in the sum itself.
Thank you!

Let A[t] be the expected profit after t questions given that either t = 0, t = n, or the t'th question was answered wrong. Then you can compute
A[0] = 0
A[t] = sum(i = 0..t-1) (probability of getting questions i .. t-2 right and t-1 wrong) * ((t-i-1)4 + A[i]) when 0 < t < n.
A[n] is computed similarly to the general case above, except you should also add a term for when all questions after the ith are answered correctly.

How can I efficiently calculate the binomial cumulative distribution function?

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).
I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.
So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.
If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.
I'm aware that this problem relates to finding the area under a binomial curve, however:
My math-fu is not up to the task of translating this knowledge into efficient code
While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.
I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.
I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.

Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323

I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.

I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.

I can't totally vouch for the efficiency, but Scipy has a module for this
from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)

An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.
I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials
Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.
Lets say I know the probability of throwing a heads with a particular coin is P.
What is the probability of me throwing
the coin T times and getting at least
S heads?
Set n = T
Set beta[i] = 0 for i = 0, ... S - 1
Set beta[i] = 1 for i = S, ... T
Set t = p
Evaluate B(t) using de Casteljau
or at most S heads?
Set n = T
Set beta[i] = 1 for i = 0, ... S
Set beta[i] = 0 for i = S + 1, ... T
Set t = p
Evaluate B(t) using de Casteljau
Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.

If you are using Python, no need to code it yourself. Scipy got you covered:
from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434
# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777

From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.

The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.
If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.

Try this one, used in GMP. Another reference is this.

import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of
heads percentage and 10000 times
done.
sum(x>12)/len(x)
The output is 41% of times we got 12 heads.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio