Simple algorithm to estimate probability based on past occurences? - algorithm

Suppose after N occurrences, there are P times that an event happens. The "naive" approach to estimate the probability of that event happen again the next time is P/N, but obviously the higher N is, the better our estimation.
What is a practical approach to model that "sureness" in the real world? I don't need something mathematically perfect, just something to make it a little bit more realistic. For example:
if a footballer scores 9 goals in 40 matches then I want the algorithm to rate him higher than a footballer who scores 1 goal in 4 matches
a movie with a rating of 8.0 with 100k votes should be placed higher than a 8.2 movie with 2k votes
etc...

This looks like the wilson-score interval: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval. The wilson-score solves the problem how to sort a 2d array.

Related

Programming a probability to allow an AI decide when to discard a card or not in 5 card poker

I am writing an AI to play 5 card poker, where you are allowed to discard a card from your hand and swap it for another randomly dealt one if you wish. My AI can value every possible poker hand as shown in the answer to my previous question. In short, it assigns a unique value to each possible hand where a higher value correlates to a better/winning hand.
My task is to now write a function, int getDiscardProbability(int cardNumber) that gives my AI a number from 0-100 relating to whether or not it should discard this card (0 = defintely do not discard, 100 = definitely discard).
The approach I have thought up of was to compute every possible hand by swapping this card for every other card in the deck (assume there are still 47 left, for now), then compare each of their values with the current hand, count how many are better and so (count / 47) * 100 is my probability.
However, this solution is simply looking for any better hand, and not distinguishing between how much better one hand is. For example, if my AI had the hand 23457, it could discard the 7 for an K, producing a very slightly better hand (better high card), or it could exchange the 7 for an A or a 6, completing the Straight - a much better hand (much higher value) than a High King.
So, when my AI is calculating this probability, it would be increased by the same amount when it sees that the hand could be improved by getting the K than it would when it sees that the hand could be improved by getting an A or 6. Because of this, I somehow need to factor in the difference in value from my hand and each of the possible hands when calculating this probability. What would be a good approach to achieve this with?
Games in general have a chicken-egg problem: you want to design an AI that can beat a good player, but you need a good AI to train your AI against. I'll assume you're making an AI for a 2-player version of poker that has antes but no betting.
First, I'd note that if I had a table of probabilities for win-rate for each possible poker hand (of which there are surprisingly few really different ones), one can write a function that tells you the expected value from discarding a set of cards from your hand: simply enumerate all possible replacement cards and average the probability of winning with the hands. There's not that many cards to evaluate -- even if you don't ignore suits, and you're replacing the maximum 3 cards, you have only 47 * 46 * 43 / 6 = 16215 possibilities. In practice, there's many fewer interesting possibilities -- for example, if the cards you don't discard aren't all of the same suit, you can ignore suits completely, and if they are of the same suit, you only need to distinguish "same suit" replacements with "different suits" replacement. This is slightly trickier than I describe it, since you've got to be careful to count possibilities right.
Then your AI can work by enumerating all the possible sets of cards to discard of which there are (5 choose 0) + (5 choose 1) + (5 choose 2) + (5 choose 3) = 1 + 5 + 10 + 10 = 26, and pick the one with the highest expectation, as computed above.
The chicken-egg problem is that you don't have a table of win-rate probabilities per hand. I describe an approach for a different poker-related game here, but the idea is the same: http://paulhankin.github.io/ChinesePoker/ . This approach is not my idea, and essentially the same idea is used for example in game-theory-optimal solvers for real poker variants like piosolver.
Here's the method.
Start with a table of probabilities made up somehow. Perhaps you just start assuming the highest rank hand (AKQJTs) wins 100% of the time and the worst hand (75432) wins 0% of the time, and that probabilities are linear in between. It won't matter much.
Now, simulate tens of thousands of hands with your AI and count how often each hand rank is played. You can use this to construct a new table of win-rate probabilities. This new table of win-rate probabilities is (ignoring some minor theoretical issues) an optimal counter-strategy to your AI in that an AI that uses this table knows how likely your original AI is to end up with each hand, and plays optimally against that.
The natural idea is now to repeat the process again, and hope this yields better and better AIs. However, the process will probably oscillate and not settle down. For example, if at one stage of your training your AI tends to draw to big hands, the counter AI will tend to play very conservatively, beating your AI when it misses its draw. And against a very conservative AI, a slightly less conservative AI will do better. So you'll tend to get a sequence of less and less conservative AIs, and then a tipping point where your AI is beaten again by an ultra-conservative one.
But the fix for this is relatively simple -- just blend the old table and the new table in some way (one standard way is to, at step i, replace the table with a weighted average of 1/i of the new table and (i-1)/i of the old table). This has the effect of not over-adjusting to the most recent iteration. And ignoring some minor details that occur because of assumptions (for example, ignoring replacement effects from the original cards in your hand), this approach will give you a game-theoretically optimal AI, as described in: "An iterative method of solving a game, Julia Robinson (1950)."
A simple (but not so simple) way would be to use some kind of database with the hand combination probabilities (maybe University of Alberta Computer Poker Research Group Database).
The idea is getting to know each combination how much percentage of winning has. And doing the combination and comparing that percentage of each possible hand.
For instance, you have 5 cards, AAAKJ, and it's time to discard (or not).
AAAKJ has a winning percentage (which I ignore, lets say 75)
AAAK (discarting J) has a 78 percentage (let's say).
AAAJ (discarting K) has x.
AAA (discarting KJ) has y.
AA (discarting AKJ) has z.
KJ (discarting AAA) has 11 (?)..
etc..
And the AI would keep the one from the combination which had a higher rate of success.
Instead of counting how many are better you might compute a sum of probabilities Pi that the new hand (with swapped card) will win, i = 1, ..., 47.
This might be a tough call because of other players as you don't know their cards, and thus, their current chances to win. To make it easier, maybe an approximation of some sort can be applied.
For example, Pi = N_lose / N where N_lose is the amount of hands that would lose to the new hand with ith card, and N is the total possible amount of hands without the 5 that the AI is holding. Finally, you use the sum of Pi instead of count.

Most optimal match-up

Let's assume you're a baseball manager. And you have N pitchers in your bullpen (N<=14) and they have to face M batters (M<=100). Also to mention you know the strength of each of the pitchers and each of the batters. For those who are not familiar to baseball once you brought in a relief pitcher he can pitch to k consecutive batters, but once he's taken out ofthe game he cannot come back.
For each pitcher the probability that he's gonna lose his match-ups is given by (sum of all batter he will face)/(his strength). Try to minimize these probabilities, i.e. try to maximize your chances of winning the game.
For example we have 3 pitchers and they have to face 3 batters. The batters' stregnths are:
10 40 30
While the strength of your pitchers is:
40 30 3
The most optimal solution would be to bring the strongest pitcher to face the first 2 batters and the second to face the third batter. Then the probability of every pitcher losing his game will be:
50/40 = 1.25 and 30/30 = 1
So the probability of losing the game would be 1.25 (This number can be bigger than 100).
How can you find the optimal number? I was thinking to take a greedy approach, but I suspect whether it will always hold. Also the fact that the pitcher can face unlimited number of batters (I mean it's only limited by M) poses the major problem for me.
Probabilities must be in the range [0.0, 1.0] so what you call a probability can't be a probability. I'm just going to call it a score and minimize it.
I'm going to assume for now that you somehow know the order in which the pitchers should play.
Given the order, what is left to decide is how long each pitcher plays. I think you can find this out using dynamic programming. Consider the batters to be faced in order. Build an NxM table best[pitchers, batter] where best[i, j] is the best score you can make considering just the first j batters using the first i pitchers, or HUGE if it does not make sense.
best[1,1] is just the score for the best pitcher against the first batter, and best[1,j] doesn't make sense for any other values of j.
For larger values of i you work out best[i,j] by considering when the last change of pitcher could be, considering all possibilities (so 1, 2, 3...i). If the last change of pitcher was at time t, then look up best[t, j-1] to get the score up to the time just before that change, and then calculate the a/b value to take account of the sum of batter strengths between time t+1 and time i. When you have considered all possible times, take the best score and use it as the value for best[i, j]. Note down enough info (such as the last time of pitcher change that turned out to be best) so that once you have calculated best[N, M], you can trace back to find the best schedule.
You don't actually know the order, and because the final score is the maximum of the a/b value for each pitcher, the order does matter. However, given a separation of players into groups, the best way to assign pitchers to groups is to assign the best pitcher to the group with the highest total score, the next best pitcher to the group with the next best total score, and so on. So you could alternate between dividing batters into groups, as described above, and then assigning pitchers to groups to work out the order the pitchers really should be in - keep doing this until the answer stops changing and hope the result is a global optimum. Unfortunately there is no guarantee of this.
I'm not convinced that your score is a good model for baseball, especially since it started out as a probability but can't be. Perhaps you should work out a few examples (maybe even solving small examples by brute force) and see if the results look reasonable.
Another way to approach this problem is via http://en.wikipedia.org/wiki/Branch_and_bound.
With branch and bound you need some way to describe partial answers, and you need some way to work out a value V for a given partial answer, such that no way of extending that partial answer can possibly produce a better answer than V. Then you run a tree search, extending partial answers in every possible way, but discarding partial answers which can't possibly be any better than the best answer found so far. It is good if you can start off with at least a guess at the best answer, because then you can discard poor partial answers from the start. My other answer might provide a way of getting this.
Here a partial answer is a selection of pitchers, in the order they should play, together with the number of batters they should pitch to. The first partial answer would have 0 pitchers, and you could extend this by choosing each possible pitcher, pitching to each possible number of batters, giving a list of partial answers each mentioning just one pitcher, most of which you could hopefully discard.
Given a partial answer, you can compute the (total batter strength)/(Pitcher strength) for each pitcher in its selection. The maximum found here is one possible way of working out V. There is another calculation you can do. Sum up the total strengths of all the batters left and divide by the total strengths of all the pitchers left. This would be the best possible result you could get for the pitchers left, because it is the result you get if you somehow manage to allocate pitchers to batters as evenly as possible. If this value is greater than the V you have calculated so far, use this instead of V to get a less optimistic (but more accurate) measure of how good any descendant of that partial answer could possibly be.

Is there an optimal way to find the best division of an interval of some positive integers?

I am struggling with a conceptual problem.
I have positive integers from an interval [1800, 1850].
For every integer from that interval, let's say (without loss of generality) 1820, I have about 3000 horses. The 1820 number is a year of birth for a horse. Thoses horses were fed with a traditional food and some of those horses were fed with experimental food (there were 29 types of different experimental food). For every horse there was recorded a variable for each feeding named goodness of sneeze (the higer the goodness variable is, the better). Let's assume after every feeding a horse did sneeze. Every single horse could be fed with different type of food every time he came on feeding (with uniform distribution). Let us assume that sneeze for horses comes from Poisson distribution with lamba=1 parameter.
Now I am looking for the best [1800,1850] interval division on intervals like:
[1800,1810), [1810,1826), [1826,1850]
to say: for every subinterval this or that experimental food (or maybe traditional in some cases) gave best average sneeze for horses born in that interval.
I do not know if it is needed, but let's assume that horses does not come on feeding with regularity. Some of them come more often than others. Experiment took 20 days.
If there is a good way of generating the best interval in a relatively fast way?
I tried to make a loop for i in 1 to 50 where i is a number of [1800,1850] interval divisions centers.
If i=1
I check:
[1800,1801],(1802,1850]
[1800,1802],(1803,1850]
...
[1800,1849],(1849,1850]
and check which experimental food gave the biggest mean sneeze in that subinterval and answer the problem as this example:
[1800,1807], (1807,1850]
is the best division from division with 1 interval centers for horses born in [1800,1807] the best food is experimentalFoodnr25 and for horses born in (1807,1850] the best food is experimentalFoodnr14.
With respect to traditional food they give 0,04 higher mean sneeze for horses. (0.04 is of course a weighted mean with respect to number of horses in both intervals)
Then I can go for i=2, and so on and so on but there higher the i is, the less horses are in the subintervals and the estimate of the average sneeze has greater standard error.
So I thought about to choose the best [1800,1850] division that has the biggest
weighted mean of a's where a is calculated from subinterval and is to be as formula:
$a = \phi( 1- p )^{-1} \times \sqrt{ Var(X)/n_{x} + Var(Y)/n_{y} } + \mu_{X} - \mu_{Y}$
where $X$ are the records for horses treated with the experimental food giving the highest average sneeze in that subinterval, $Y$ are the records for horses treated with traditional food in that subinterval. $\mu$ are means of that records, $Var$ are variances and p is the probability of that $P( \mu_{X}-\mu_{Y}>a)=p$ (where I assume $\mu_{X}$ has normal distributions) and $\phi$ is a standard normal distribution function and n's are number of records.
Can someone has any idea of relatively fast algorithm for that problem?
If the problem is not clear please tell me what to specify.

How can I measure trends in certain words, like Twitter?

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.
For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678
While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321
I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.
In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49
Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!
First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.
In fact, the NPR article suggests this.
Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.
For the first derivative, do something like this:
d[i] = (data[i] - data[i+k])/k
for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:
d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)
Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.
I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).
I would divide by the IDF of a word to give the growth of more popular words more weight.
IDF[word] = log(D/(df[word))
where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).
A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.
The problem with your approach (measuring daily growth in percentage) is that it disregards the usual "background level" of the word, as your example shows; 'europe' grows more quickly than 'recession', yet is has a much lower score.
If the background level of words has a well-behaved distribution (Gaussian, or something else that doesn't wander too far from the mean) then I think a modification of CanSpice's suggestion would be a good idea. Work out the mean and standard deviation for each word, using days C-N+1-T to C-T, where C is the current date, N is the number of days to take into account, and T is the number of days that define a trend.
Say for instance N=90 and T=3, so we use about three months for the background, and say a trend is defined by three peaks in a row. In that case, for example, you can rank the words according to their chi-squared p-value, calculated like so:
(mu, sigma) = fitGaussian(word='europe', startday=C-N+1-3, endday=C-3)
X1 = count(word='europe', day=C-2)
X2 = count(word='europe', day=C-1)
X3 = count(word='europe', day=C)
S = ((X1-mu)/sigma)^2 + ((X2-mu)/sigma)^2 + ((X3-mu)/sigma)^2
p = pval.chisq(S, df=3)
Essentially then, you can get the words which over the last three days are the most extreme compared to their background level.
I would first try a simple solution. A simple weighted difference between adjacent day should probably work. Maybe taking the log before that. You might have to experiment with the weights. For examle (-2,-1,1,2) would give you points where the data is exploding.
If this is not enough, you can try slope filtering ( http://www.claysturner.com/dsp/fir_regression.pdf ). Since the algorithm is based on linear regression, it should be possible to modify it for other types of regression (for example quadratic).
All attempts using filtering techniques such as these also have the advantage, that they can be made to run very fast and you should be able to find libraries that provide fast filtering.

Translating a score into a probabilty

People visit my website, and I have an algorithm that produces a score between 1 and 0. The higher the score, the greater the probability that this person will buy something, but the score isn't a probability, and it may not be a linear relationship with the purchase probability.
I have a bunch of data about what scores I gave people in the past, and whether or not those people actually make a purchase.
Using this data about what happened with scores in the past, I want to be able to take a score and translate it into the corresponding probability based on this past data.
Any ideas?
edit: A few people are suggesting bucketing, and I should have mentioned that I had considered this approach, but I'm sure there must be a way to do it "smoothly". A while ago I asked a question about a different but possibly related problem here, I have a feeling that something similar may be applicable but I'm not sure.
edit2: Let's say I told you that of the 100 customers with a score above 0.5, 12 of them purchased, and of the 25 customers with a score below 0.5, 2 of them purchased. What can I conclude, if anything, about the estimated purchase probability of someone with a score of 0.5?
Draw a chart - plot the ratio of buyers to non buyers on the Y axis and the score on the X axis - fit a curve - then for a given score you can get the probability by the hieght of the curve.
(you don't need to phyically create a chart - but the algorithm should be evident from the exercise)
Simples.
That is what logistic regression, probit regression, and company were invented for. Nowdays most people would use logistic regression, but fitting involves iterative algorithms - there are, of course, lots of implementations, but you might not want to write one yourself. Probit regression has an approximate explicit solution described at the link that might be good enough for your purposes.
A possible way to assess whether logistic regression would work for your data, would be to look at a plot of each score versus the logit of the probability of purchase (log(p/(1-p)), and see whether these form a straight line.
I eventually found exactly what I was looking for, an algorithm called “pair-adjacent violators”. I initially found it in this paper, however be warned that there is a flaw in their description of the implementation.
I describe the algorithm, this flaw, and the solution to it on my blog.
Well, the straightforward way to do this would be to calculate which percentage of people in a score interval purchased something and do this for all intervals (say, every .05 points).
Have you noticed an actual correlation between a higher score and an increased likelihood of purchases in your data?
I'm not an expert in statistics and there might be a better answer though.
You could divide the scores into a number of buckets, e.g. 0.0-0.1, 0.1-0.2,... and count the number of customers who purchased and did not purchase something for each bucket.
Alternatively, you may want to plot each score against the amount spent (as a scattergram) and see if there is any obvious relationship.
You could use exponential decay to produce a weighted average.
Take your users, arrange them in order of scores (break ties randomly).
Working from left to right, start with a running average of 0. Each user you get, change the average to average = (1-p) * average + p * (sale ? 1 : 0). Do the same thing from the right to the left, except start with 1.
The smaller you make p, the smoother your curve will become. Play around with your data until you have a value of p that gives you results that you like.
Incidentally this is the key idea behind how load averages get calculated by Unix systems.
Based upon your edit2 comment you would not have enough data to make a statement. Your overall purchase rate is 11.2% That is not statistically different from your 2 purchase rates which are above/below .5 Additionally to validate your score, you would have to insure that the purchase percentages were monotonically increasing as your score increased. You could bucket but you would need to check your results against a probability calculator to make sure they did not occur by chance.
http://stattrek.com/Tables/Binomial.aspx

Resources