Design an experiment to find the unfair coin from 100 coins - probability

Suppose there are 100 coins, out of which only 1 coin is unfair. This unfair coin has the probability of less than 0.5 to get a Head. How do I design an experiment to find the unfair coin?
I think one way is for flip each coin for a large number of times (for example, 10000 times). Then the coin that has the smallest number of heads will be picked to be the unfair coin. But is there other smarter way to do this?

That would probably work, but it would be nice if you had some measure of how confident you were that you had selected the right coin, because if the coin is only very slightly unfair you might need a very large number of trials to be able to confidently say which coin was biased, and it doesn't sound like the degree of bias was given to you, so there's not a particular reason to suspect 10000 trials is enough: many more (or many fewer) trials might be necessary. For example, if the coin bias is only P(h) = 0.49999 (with P(h) = 0.5 for all other coins) and the coin which produces the fewest heads after 10000 trials only produced 1 fewer head than the second lowest, then you would probably want to factor this information in and keep going until you had mathematical reason to be confident you had found the right coin.
To achieve this you could take a Bayesian approach and do trials in batches of 100 or 1000, updating after each batch a posterior probability that each coin is fair according to Bayes' rule (giving all coins a prior probability of fairness of 99/100). You could continue this until one coin has a posterior probability of fairness of < 0.05, perhaps taking the additional step of verifying that every other coin has a posterior probability of >= .95. This way your experiment is agnostic about the number of trials necessary to find the biased coin and just proceeds until one is clearly distinguishable from the rest.

Related

How to efficiently sample the continuous negative binomial distribution?

First, for context, I am working on a game where when you do something good you earn positive credits and when you do something bad you earn negative credits, and each credit corresponds to flipping a biased coin where if you get heads then something happens (good if its a positive credit, bad if its a negative credit) and otherwise nothing happens.
The deal is that I want to handle the case of multiple credits and fractional credits, and I would like to have flips use up credits so that if something good/bad happens then the leftover credits carry over. A straightforward way of doing this is to just perform a bunch of trials, and in particular for the case of fractional credits we can multiply the number of credits by X and the likelihood of something happening by 1/X (the distribution has the same expectation but slightly different weights); unfortunately, this places a practical limit on how many credits the user can get and also how many decimal places can be in the number of credits since this results in an unbounded amount of work.
What I would like to do is to take advantage of the fact that I am sampling the continuous negative binomial distribution, which is the distribution of how many trials it takes to get heads, i.e. so that if f(X) is the distribution then f(X) gives the probability that there will be X tails before we run into a heads, where X need not be an integer. If I can sample this distribution, then what I can do is that if X is the number of tails then I can see if X is greater or less than the number of credits; if it is greater than then we use up all of the credits but nothing happens, and if it is less than or equal to then something good happens and we subtract X from the number of credits. Furthermore, because the distribution is continuous I can easily handle fractional credits.
Does anyone know of a way for me to be able to efficiently sample the continuous negative binomial distribution (that is, a function that generates random numbers from this distribution)?
This question may be better answered on StatsExchange, but here I will take a stab at it.
You are correct that trying to compute this directly will be computationally expensive as you cannot avoid the beta and/or gamma function dependencies. The only statistically valid approximation I'm aware of is if the number of successes s required is large, and p is neither very small nor very large, then you can approximate it with a normal distribution with special values for the mean and variance. You can read more here but I'm guessing this approximation will not be generally applicable for you.
The negative binomial distribution can also be approximated as a mixture of Poisson distributions, but this doesn't save you from the gamma function dependency.
The only efficient class of negative binomial samplers that I'm aware of use optimized accept-reject techniques. Pages 10-11 of this PDF here describe the concept behind the method. Page 6 (page 295 internally) of this PDF here contains source code for sampling binomial deviates using related techniques. Note that even these methods still require random uniform deviates as well as sqrt(), log(), and gammln() calls. For small numbers of trials (less than 100 maybe?) I wouldn't be surprised at all if just simulating the trials with fast random number generator is faster than even the accept-reject techniques. Definitely start by getting a fast PRNG; they are not all created equal.
Edit:
The following pseudo-code would probably be fairly efficient to draw a random discrete negative binomial-distributed value as long as p is not very large (too close to 1.0). It will return the number of trials required before reaching your first "desired" outcome (which is actually the first "failure" in terms of the distribution):
// assume p and r are the parameters to the neg. binomial dist.
// r = number of failures (you'll set to one for your purpose)
// p = probability of a "success"
double rnd = _rnd.nextDouble(); // [0.0, 1.0)
int k = 0; // represents the # of successes that occur before 1st failure
double lastPmf = (1 - p)^r;
double cdf = lastPmf;
while (cdf < rnd)
{
lastPmf *= (p * (k+r) / (k+1));
cdf += lastPmf;
k++;
}
return k;
// or return (k+1) to also count the trial on which the failure occurred
Using the recurrence relationship saves over repeating the factorial independently at each step. I think using this, combined with limiting your fractional precision to 1 or 2 decimal places (so you only need to multiply by 10 or 100 respectively) might work for your purposes. You are drawing only one random number and the rest is just multiplications--it should be quite fast.

Most optimal match-up

Let's assume you're a baseball manager. And you have N pitchers in your bullpen (N<=14) and they have to face M batters (M<=100). Also to mention you know the strength of each of the pitchers and each of the batters. For those who are not familiar to baseball once you brought in a relief pitcher he can pitch to k consecutive batters, but once he's taken out ofthe game he cannot come back.
For each pitcher the probability that he's gonna lose his match-ups is given by (sum of all batter he will face)/(his strength). Try to minimize these probabilities, i.e. try to maximize your chances of winning the game.
For example we have 3 pitchers and they have to face 3 batters. The batters' stregnths are:
10 40 30
While the strength of your pitchers is:
40 30 3
The most optimal solution would be to bring the strongest pitcher to face the first 2 batters and the second to face the third batter. Then the probability of every pitcher losing his game will be:
50/40 = 1.25 and 30/30 = 1
So the probability of losing the game would be 1.25 (This number can be bigger than 100).
How can you find the optimal number? I was thinking to take a greedy approach, but I suspect whether it will always hold. Also the fact that the pitcher can face unlimited number of batters (I mean it's only limited by M) poses the major problem for me.
Probabilities must be in the range [0.0, 1.0] so what you call a probability can't be a probability. I'm just going to call it a score and minimize it.
I'm going to assume for now that you somehow know the order in which the pitchers should play.
Given the order, what is left to decide is how long each pitcher plays. I think you can find this out using dynamic programming. Consider the batters to be faced in order. Build an NxM table best[pitchers, batter] where best[i, j] is the best score you can make considering just the first j batters using the first i pitchers, or HUGE if it does not make sense.
best[1,1] is just the score for the best pitcher against the first batter, and best[1,j] doesn't make sense for any other values of j.
For larger values of i you work out best[i,j] by considering when the last change of pitcher could be, considering all possibilities (so 1, 2, 3...i). If the last change of pitcher was at time t, then look up best[t, j-1] to get the score up to the time just before that change, and then calculate the a/b value to take account of the sum of batter strengths between time t+1 and time i. When you have considered all possible times, take the best score and use it as the value for best[i, j]. Note down enough info (such as the last time of pitcher change that turned out to be best) so that once you have calculated best[N, M], you can trace back to find the best schedule.
You don't actually know the order, and because the final score is the maximum of the a/b value for each pitcher, the order does matter. However, given a separation of players into groups, the best way to assign pitchers to groups is to assign the best pitcher to the group with the highest total score, the next best pitcher to the group with the next best total score, and so on. So you could alternate between dividing batters into groups, as described above, and then assigning pitchers to groups to work out the order the pitchers really should be in - keep doing this until the answer stops changing and hope the result is a global optimum. Unfortunately there is no guarantee of this.
I'm not convinced that your score is a good model for baseball, especially since it started out as a probability but can't be. Perhaps you should work out a few examples (maybe even solving small examples by brute force) and see if the results look reasonable.
Another way to approach this problem is via http://en.wikipedia.org/wiki/Branch_and_bound.
With branch and bound you need some way to describe partial answers, and you need some way to work out a value V for a given partial answer, such that no way of extending that partial answer can possibly produce a better answer than V. Then you run a tree search, extending partial answers in every possible way, but discarding partial answers which can't possibly be any better than the best answer found so far. It is good if you can start off with at least a guess at the best answer, because then you can discard poor partial answers from the start. My other answer might provide a way of getting this.
Here a partial answer is a selection of pitchers, in the order they should play, together with the number of batters they should pitch to. The first partial answer would have 0 pitchers, and you could extend this by choosing each possible pitcher, pitching to each possible number of batters, giving a list of partial answers each mentioning just one pitcher, most of which you could hopefully discard.
Given a partial answer, you can compute the (total batter strength)/(Pitcher strength) for each pitcher in its selection. The maximum found here is one possible way of working out V. There is another calculation you can do. Sum up the total strengths of all the batters left and divide by the total strengths of all the pitchers left. This would be the best possible result you could get for the pitchers left, because it is the result you get if you somehow manage to allocate pitchers to batters as evenly as possible. If this value is greater than the V you have calculated so far, use this instead of V to get a less optimistic (but more accurate) measure of how good any descendant of that partial answer could possibly be.

Is there an optimal way to find the best division of an interval of some positive integers?

I am struggling with a conceptual problem.
I have positive integers from an interval [1800, 1850].
For every integer from that interval, let's say (without loss of generality) 1820, I have about 3000 horses. The 1820 number is a year of birth for a horse. Thoses horses were fed with a traditional food and some of those horses were fed with experimental food (there were 29 types of different experimental food). For every horse there was recorded a variable for each feeding named goodness of sneeze (the higer the goodness variable is, the better). Let's assume after every feeding a horse did sneeze. Every single horse could be fed with different type of food every time he came on feeding (with uniform distribution). Let us assume that sneeze for horses comes from Poisson distribution with lamba=1 parameter.
Now I am looking for the best [1800,1850] interval division on intervals like:
[1800,1810), [1810,1826), [1826,1850]
to say: for every subinterval this or that experimental food (or maybe traditional in some cases) gave best average sneeze for horses born in that interval.
I do not know if it is needed, but let's assume that horses does not come on feeding with regularity. Some of them come more often than others. Experiment took 20 days.
If there is a good way of generating the best interval in a relatively fast way?
I tried to make a loop for i in 1 to 50 where i is a number of [1800,1850] interval divisions centers.
If i=1
I check:
[1800,1801],(1802,1850]
[1800,1802],(1803,1850]
...
[1800,1849],(1849,1850]
and check which experimental food gave the biggest mean sneeze in that subinterval and answer the problem as this example:
[1800,1807], (1807,1850]
is the best division from division with 1 interval centers for horses born in [1800,1807] the best food is experimentalFoodnr25 and for horses born in (1807,1850] the best food is experimentalFoodnr14.
With respect to traditional food they give 0,04 higher mean sneeze for horses. (0.04 is of course a weighted mean with respect to number of horses in both intervals)
Then I can go for i=2, and so on and so on but there higher the i is, the less horses are in the subintervals and the estimate of the average sneeze has greater standard error.
So I thought about to choose the best [1800,1850] division that has the biggest
weighted mean of a's where a is calculated from subinterval and is to be as formula:
$a = \phi( 1- p )^{-1} \times \sqrt{ Var(X)/n_{x} + Var(Y)/n_{y} } + \mu_{X} - \mu_{Y}$
where $X$ are the records for horses treated with the experimental food giving the highest average sneeze in that subinterval, $Y$ are the records for horses treated with traditional food in that subinterval. $\mu$ are means of that records, $Var$ are variances and p is the probability of that $P( \mu_{X}-\mu_{Y}>a)=p$ (where I assume $\mu_{X}$ has normal distributions) and $\phi$ is a standard normal distribution function and n's are number of records.
Can someone has any idea of relatively fast algorithm for that problem?
If the problem is not clear please tell me what to specify.

Simple algorithm to estimate probability based on past occurences?

Suppose after N occurrences, there are P times that an event happens. The "naive" approach to estimate the probability of that event happen again the next time is P/N, but obviously the higher N is, the better our estimation.
What is a practical approach to model that "sureness" in the real world? I don't need something mathematically perfect, just something to make it a little bit more realistic. For example:
if a footballer scores 9 goals in 40 matches then I want the algorithm to rate him higher than a footballer who scores 1 goal in 4 matches
a movie with a rating of 8.0 with 100k votes should be placed higher than a 8.2 movie with 2k votes
etc...
This looks like the wilson-score interval: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval. The wilson-score solves the problem how to sort a 2d array.

Guessing an unbounded integer

If I say to you:
"I am thinking of a number between 0 and n, and I will tell you if your guess is high or low", then you will immediately reach for binary search.
What if I remove the upper bound? i.e. I am thinking of a positive integer, and you need to guess it.
One possible method would be for you to guess 2, 4, 8, ..., until you guess 2**k for some k and I say "lower". Then you can apply binary search.
Is there a quicker method?
EDIT:
Clearly, any solution is going to take time proportional to the size of the target number. If I chuck Graham's number through the Ackermann function, we'll be waiting a while whatever strategy you pursue.
I could offer this algorithm too: Guess each integer in turn, starting from 1.
It's guaranteed to finish in a finite amount of time, but yet it's clearly much worse than my "powers of 2" strategy. If I can find a worse algorithm (and know that it is worse), then maybe I could find a better one?
For example, instead of powers of 2, maybe I can use powers of 10. Then I find the upper bound in log_10(n) steps, instead of log_2(n) steps. But I have to then search a bigger space. Say k = ceil(log_10(n)). Then I need log_2(10**k - 10**(k-1)) steps for my binary search, which I guess is about 10+log_2(k). For powers of 2, I have roughly log_2(log_2(n)) steps for my search phase. Which wins?
What if I search upwards using n**n? Or some other sequence? Does the prize go to whoever can find the sequence that grows the fastest? Is this a problem with an answer?
Thank you for your thoughts. And my apologies to those of you suggesting I start at MAX_INT or 2**32-1, since I'm clearly drifting away from the bounds of practicality here.
FINAL EDIT:
Hi all,
Thank you for your responses. I accepted the answer by Norman Ramsey (and commenter onebyone) for what I understood to be the following argument: for a target number n, any strategy must be capable of distinguishing between (at least) the numbers from 0..n, which means you need (at least) O(log(n)) comparisons.
However seveal of you also pointed out that the problem is not well-defined in the first place, because it's not possible to pick a "random positive integer" under the uniform probability distribution (or, rather, a uniform probability distribution cannot exist over an infinite set). And once I give you a nonuniform distribution, you can split it in half and apply binary search as normal.
This is a problem that I've often pondered as I walk around, so I'm pleased to have two conclusive answers for it.
If there truly is no upper bound, and all numbers all the way to infinity are equally likely, then there is no optimum way to do this. For any finite guess G, the probability that the number is lower than G is zero and the probability that it is higher is 1 - so there is no finite guess that has an expectation of being higher than the number.
RESPONSE TO JOHN'S EDIT:
By the same reasoning that powers of 10 are expected to be better than powers of 2 (there's only a finite number of possible Ns for which powers of 2 are better, and an infinite number where powers of 10 are better), powers of 20 can be shown to be better than powers of 10.
So basically, yes, the prize goes to fastest-growing sequence (and for the same sequence, the highest starting point) - for any given sequence, it can be shown that a faster growing sequence will win in infinitely more cases. And since for any sequence you name, I can name one that grows faster, and for any integer you name, I can name one higher, there's no answer that can't be bettered. (And every algorithm that will eventually give the correct answer has an expected number of guesses that is infinite, anyway).
People (who have never studied probability) tend to think that "pick a number from 1 to N" means "with equal probability of each", and they act according to their intuitive understanding of probability.
Then when you say "pick any positive integer", they still think it means "with equal probability of each".
This is of course impossible - there exists no discrete probability distribution with domain the positive integers, where p(n) == p(m) for all n, m.
So, the person picking the number must have used some other probability distribution. If you know anything at all about that distribution, then you must base your guessing scheme on that knowledge in order to have the "fastest" solution.
The only way to calculate how "fast" a given guessing scheme is, is to calculate its expected number of guesses to find the answer. You can only do this by assuming a probability distribution for the target number. For example, if they have picked n with probability (1/2) ^ n, then I think your best guessing scheme is "1", "2", "3",... (average 2 guesses). I haven't proved it, though, maybe it's some other sequence of guesses. Certainly the guesses should start small and grow slowly. If they have picked 4 with probability 1 and all other numbers with probability 0, then your best guessing scheme is "4" (average 1 guess). If they have picked a number from 1 to a trillion with uniform distribution, then you should binary search (average about 40 guesses).
I say the only way to define "fast" - you could look at worst case. You have to assume a bound on the target, to prevent all schemes having the exact same speed, namely "no bound on the worst case". But you don't have to assume a distribution, and the answer for the "fastest" algorithm under this definition is obvious - binary search starting at the bound you selected. So I'm not sure this definition is terribly interesting...
In practice, you don't know the distribution, but can make a few educated guesses based on the fact that the picker is a human being, and what numbers humans are capable of conceiving. As someone says, if the number they picked is the Ackermann function for Graham's number, then you're probably in trouble. But if you know that they are capable of representing their chosen number in digits, then that actually puts an upper limit on the number they could have chosen. But it still depends what techniques they might have used to generate and record the number, and hence what your best knowledge is of the probability of the number being of each particular magnitude.
Worst case, you can find it in time logarithmic in the size of the answer using exactly the methods you describe. You might use Ackermann's function to find an upper bound faster than logarithmic time, but then the binary search between the number guessed and the previous guess will require time logarithmic in the size of the interval, which (if guesses grow very quickly) is close to logarithmic in the size of the answer.
It would be interesting to try to prove that there is no faster algorithm (e.g., O(log log n)), but I have no idea how to do it.
Mathematically speaking:
You cannot ever correctly find this integer. In fact, strictly speaking, the statement "pick any positive integer" is meaningless as it cannot be done: although you as a person may believe you can do it, you are actually picking from a bounded set - you are merely unconscious of the bounds.
Computationally speaking:
Computationally, we never deal with infinites, as we would have no way of storing or checking against any number larger than, say, the theoretical maximum number of electrons in the universe. As such, if you can estimate a maximum based on the number of bits used in a register on the device in question, you can carry out a binary search.
Binary search can be generalized: each time set of possible choices should be divided into to subsets of probability 0.5. In this case it's still applicable to infinite sets, but still requires knowledge about distribution (for finite sets this requirement is forgotten quite often)...
My main refinement is that I'd start with a higher first guess instead of 2, around the average of what I'd expect them to choose. Starting with 64 would save 5 guesses vs starting with 2 when the number's over 64, at the cost of 1-5 more when it's less. 2 makes sense if you expect the answer to be around 1 or 2 half the time. You could even keep a memory of past answers to decide the best first guess. Another improvement could be to try negatives when they say "lower" on 0.
If this is guessing the upper bound of a number being generated by a computer, I'd start with 2**[number of bits/2], then scale up or down by powers of two. This, at least, gets you the closest to the possible values in the least number of jumps.
However, if this is a purely mathematical number, you can start with any value, since you have an infinite range of values, so your approach would be fine.
Since you do not specify any probability distribution of the numbers (as others have correctly mentioned, there is no uniform distribution over all the positive integers), the No Free Lunch Theorem give the answer: any method (that does not repeat the same number twice) is as good as any other.
Once you start making assumptions about the distribution (f.x. it is a human being or binary computer etc. that chooses the number) this of course changes, but as the problem is stated any algorithm is as good as any other when averaged over all possible distributions.
Use binary search starting with MAX_INT/2, where MAX_INT is the biggest number your platform can handle.
No point in pretending we can actually have infinite possibilities.
UPDATE: Given that you insist on entering the realms of infinity, I'll just vote to close your question as not programming related :-)
The standard default assumption of a uniform distribution for all positive integers doesn't lead to a solution, so you should start by defining the probability distribution of the numbers to guess.
I'd probably start my guessing with Graham's Number.
The practical answer within a computing context would be to start with whatever is the highest number that can (realistically) be represented by the type you are using. In case of some BigInt type you'd probably want to make a judgement call about what is realistic... obviously ultimately the bound in that case is the available memory... but performance-wise something smaller may be more realistic.
Your starting point should be the largest number you can think of plus 1.
There is no 'efficient search' for a number in an infinite range.
EDIT: Just to clarify, for any number you can think of there are still infinitely more numbers that are 'greater' than your number, compared to a finite collection of numbers that are 'less' than your number. Therefore, assuming the chosen number is randomly selected from all positive numbers, you have zero | (approaching zero) chance of being 'above' the chosen number.
I gave an answer to a similar question "Optimal algorithm to guess any random integer without limits?"
Actually, provided there algorithm not just searches for the conceived number, but it estimates a median of the distribution of the number that you may re-conceive at each step! And also the number could be even from the real domain ;)

Resources