Is there an optimal way to find the best division of an interval of some positive integers? - algorithm

I am struggling with a conceptual problem.
I have positive integers from an interval [1800, 1850].
For every integer from that interval, let's say (without loss of generality) 1820, I have about 3000 horses. The 1820 number is a year of birth for a horse. Thoses horses were fed with a traditional food and some of those horses were fed with experimental food (there were 29 types of different experimental food). For every horse there was recorded a variable for each feeding named goodness of sneeze (the higer the goodness variable is, the better). Let's assume after every feeding a horse did sneeze. Every single horse could be fed with different type of food every time he came on feeding (with uniform distribution). Let us assume that sneeze for horses comes from Poisson distribution with lamba=1 parameter.
Now I am looking for the best [1800,1850] interval division on intervals like:
[1800,1810), [1810,1826), [1826,1850]
to say: for every subinterval this or that experimental food (or maybe traditional in some cases) gave best average sneeze for horses born in that interval.
I do not know if it is needed, but let's assume that horses does not come on feeding with regularity. Some of them come more often than others. Experiment took 20 days.
If there is a good way of generating the best interval in a relatively fast way?
I tried to make a loop for i in 1 to 50 where i is a number of [1800,1850] interval divisions centers.
If i=1
I check:
[1800,1801],(1802,1850]
[1800,1802],(1803,1850]
...
[1800,1849],(1849,1850]
and check which experimental food gave the biggest mean sneeze in that subinterval and answer the problem as this example:
[1800,1807], (1807,1850]
is the best division from division with 1 interval centers for horses born in [1800,1807] the best food is experimentalFoodnr25 and for horses born in (1807,1850] the best food is experimentalFoodnr14.
With respect to traditional food they give 0,04 higher mean sneeze for horses. (0.04 is of course a weighted mean with respect to number of horses in both intervals)
Then I can go for i=2, and so on and so on but there higher the i is, the less horses are in the subintervals and the estimate of the average sneeze has greater standard error.
So I thought about to choose the best [1800,1850] division that has the biggest
weighted mean of a's where a is calculated from subinterval and is to be as formula:
$a = \phi( 1- p )^{-1} \times \sqrt{ Var(X)/n_{x} + Var(Y)/n_{y} } + \mu_{X} - \mu_{Y}$
where $X$ are the records for horses treated with the experimental food giving the highest average sneeze in that subinterval, $Y$ are the records for horses treated with traditional food in that subinterval. $\mu$ are means of that records, $Var$ are variances and p is the probability of that $P( \mu_{X}-\mu_{Y}>a)=p$ (where I assume $\mu_{X}$ has normal distributions) and $\phi$ is a standard normal distribution function and n's are number of records.
Can someone has any idea of relatively fast algorithm for that problem?
If the problem is not clear please tell me what to specify.

Related

How to shuffle eight items to approximate maximum entropy?

I need to analyze 8 chemical samples repeatedly over 5 days (each sample is analyzed exactly once every day). I'd like to generate pseudo-random sample sequences for each day which achieve the following:
avoid bias in the daily sequence position (e.g., avoid some samples being processed mostly in the morning)
avoid repeating sample pairs over different days (e.g. 12345678 on day 1 and 87654321 on day 2)
generally randomize the distance between two given samples from one day to the other
I may have poorly phrased the conditions above, but the general idea is to minimize systematic effects like sample cross-contamination and/or analytical drift over each day. I could just shuffle each sequence randomly, but because the number of sequences generated is small (N=5 versus 40,320 possible combinations), I'm unlikely to approach something like maximum entropy.
Any ideas? I suspect this is a common problem in analytical science which has been solved, but I don't know where to look.
By just thinking about:
The base metric that you may be use is the Levenshtein distances or some slightly modification (maybe
myDist(w1, w2) = min(levD(w1, w2), levD(w1.reversed(), w2))
)
Since you want to prevent near distances between any pair of days,
the overall metric can be the sum of the any combinations of sample orders between two days.
Similarity = myDist(day1, day2)
+ myDist(day1, day3)
+ myDist(day1, day4)
+ myDist(day1, day5)
+ myDist(day2, day3)
+ myDist(day2, day4)
+ myDist(day2, day5)
+ myDist(day3, day4)
+ myDist(day3, day5)
+ myDist(day4, day5)
That still is missing, is a heuristic how to create the sample orders.
Your problem reminds me on some fastest path finding problem but with the further difficulty that each selected node influences the weights of the whole graph. So it is much harder.
Maybe a table with all myDistdistances between each pair of the 8! combinations can be created (its commutative, only triangular (without identity diagonal) matrix requiring (~1GB memory)) This may help speeding things up very much.
Maybe take the max from this matrix and consider each combination with value below some threshold as equally worthless to reduce the searchspace.
Build a starting set.
Use 12345678 as the fix day1 since first day does not matter. Never change this.
then repeat until n days are chose:
adding the most distant point from current point.
If there are multiple equal possibility, use the one, that also is most distant from the previous days.
Now iteratively improve the solution - maybe with some ruin-and-recreate-approach. You should always backup the absolute maximum you found and you are able to run as many iterations as you want (and you have time for)
chose (one or two) day(s) with the smallest distance sums to other days
maybe brute force an optimal (in terms of overall distance) combination for these two days.
repeat
If optimization stucks (only same 2 days are chosen or distance is not getting smaller at all)
randomly change one or two days to random orders.
may be totally random (beside day1) starting sets can be selected

Programming a probability to allow an AI decide when to discard a card or not in 5 card poker

I am writing an AI to play 5 card poker, where you are allowed to discard a card from your hand and swap it for another randomly dealt one if you wish. My AI can value every possible poker hand as shown in the answer to my previous question. In short, it assigns a unique value to each possible hand where a higher value correlates to a better/winning hand.
My task is to now write a function, int getDiscardProbability(int cardNumber) that gives my AI a number from 0-100 relating to whether or not it should discard this card (0 = defintely do not discard, 100 = definitely discard).
The approach I have thought up of was to compute every possible hand by swapping this card for every other card in the deck (assume there are still 47 left, for now), then compare each of their values with the current hand, count how many are better and so (count / 47) * 100 is my probability.
However, this solution is simply looking for any better hand, and not distinguishing between how much better one hand is. For example, if my AI had the hand 23457, it could discard the 7 for an K, producing a very slightly better hand (better high card), or it could exchange the 7 for an A or a 6, completing the Straight - a much better hand (much higher value) than a High King.
So, when my AI is calculating this probability, it would be increased by the same amount when it sees that the hand could be improved by getting the K than it would when it sees that the hand could be improved by getting an A or 6. Because of this, I somehow need to factor in the difference in value from my hand and each of the possible hands when calculating this probability. What would be a good approach to achieve this with?
Games in general have a chicken-egg problem: you want to design an AI that can beat a good player, but you need a good AI to train your AI against. I'll assume you're making an AI for a 2-player version of poker that has antes but no betting.
First, I'd note that if I had a table of probabilities for win-rate for each possible poker hand (of which there are surprisingly few really different ones), one can write a function that tells you the expected value from discarding a set of cards from your hand: simply enumerate all possible replacement cards and average the probability of winning with the hands. There's not that many cards to evaluate -- even if you don't ignore suits, and you're replacing the maximum 3 cards, you have only 47 * 46 * 43 / 6 = 16215 possibilities. In practice, there's many fewer interesting possibilities -- for example, if the cards you don't discard aren't all of the same suit, you can ignore suits completely, and if they are of the same suit, you only need to distinguish "same suit" replacements with "different suits" replacement. This is slightly trickier than I describe it, since you've got to be careful to count possibilities right.
Then your AI can work by enumerating all the possible sets of cards to discard of which there are (5 choose 0) + (5 choose 1) + (5 choose 2) + (5 choose 3) = 1 + 5 + 10 + 10 = 26, and pick the one with the highest expectation, as computed above.
The chicken-egg problem is that you don't have a table of win-rate probabilities per hand. I describe an approach for a different poker-related game here, but the idea is the same: http://paulhankin.github.io/ChinesePoker/ . This approach is not my idea, and essentially the same idea is used for example in game-theory-optimal solvers for real poker variants like piosolver.
Here's the method.
Start with a table of probabilities made up somehow. Perhaps you just start assuming the highest rank hand (AKQJTs) wins 100% of the time and the worst hand (75432) wins 0% of the time, and that probabilities are linear in between. It won't matter much.
Now, simulate tens of thousands of hands with your AI and count how often each hand rank is played. You can use this to construct a new table of win-rate probabilities. This new table of win-rate probabilities is (ignoring some minor theoretical issues) an optimal counter-strategy to your AI in that an AI that uses this table knows how likely your original AI is to end up with each hand, and plays optimally against that.
The natural idea is now to repeat the process again, and hope this yields better and better AIs. However, the process will probably oscillate and not settle down. For example, if at one stage of your training your AI tends to draw to big hands, the counter AI will tend to play very conservatively, beating your AI when it misses its draw. And against a very conservative AI, a slightly less conservative AI will do better. So you'll tend to get a sequence of less and less conservative AIs, and then a tipping point where your AI is beaten again by an ultra-conservative one.
But the fix for this is relatively simple -- just blend the old table and the new table in some way (one standard way is to, at step i, replace the table with a weighted average of 1/i of the new table and (i-1)/i of the old table). This has the effect of not over-adjusting to the most recent iteration. And ignoring some minor details that occur because of assumptions (for example, ignoring replacement effects from the original cards in your hand), this approach will give you a game-theoretically optimal AI, as described in: "An iterative method of solving a game, Julia Robinson (1950)."
A simple (but not so simple) way would be to use some kind of database with the hand combination probabilities (maybe University of Alberta Computer Poker Research Group Database).
The idea is getting to know each combination how much percentage of winning has. And doing the combination and comparing that percentage of each possible hand.
For instance, you have 5 cards, AAAKJ, and it's time to discard (or not).
AAAKJ has a winning percentage (which I ignore, lets say 75)
AAAK (discarting J) has a 78 percentage (let's say).
AAAJ (discarting K) has x.
AAA (discarting KJ) has y.
AA (discarting AKJ) has z.
KJ (discarting AAA) has 11 (?)..
etc..
And the AI would keep the one from the combination which had a higher rate of success.
Instead of counting how many are better you might compute a sum of probabilities Pi that the new hand (with swapped card) will win, i = 1, ..., 47.
This might be a tough call because of other players as you don't know their cards, and thus, their current chances to win. To make it easier, maybe an approximation of some sort can be applied.
For example, Pi = N_lose / N where N_lose is the amount of hands that would lose to the new hand with ith card, and N is the total possible amount of hands without the 5 that the AI is holding. Finally, you use the sum of Pi instead of count.

"Time Aware" Exponential Moving Average

I am trying to figure out an online algorithm for "time aware" exponential moving average, sampled at varying times. By "time aware" I mean something like "giving more weight to data sampled at similar time of day", but (a) I'll give a more precise definition and (b) this is only an example for something more general that interests me.
I'll start by defining "time aware" by giving a precise example that assumes that data is sampled in constant intervals during the day; say, every 1 hour. In that case, I keep 24 different EMAs, and whenever data is sampled, I put it into the relevant EMA, taking its result and putting it in a general EMA of the results. So, at 12:00, Tuesday, I get the result of the EMA of the EMA results for 12:00, 11:00, 10:00, etc. where the EMA result for 12:00 is the EMA of some typical period of x days of data sampled at 12:00, etc.
This is an online algorithm which works well and provides reasonable results for the case where the data is sampled in constant time intervals. Without that assumption, its results become meaningless, or perhaps it is not even well defined.
The more general case can be described so: at a given moment I have a set of samples, each is a tuple (x,v) where x is some sample invariant (can be thought of as the sampling "location") and v is the sampling "value", and I would like to find out the (weighted) average at some "location" y, where the weights have negative correlation to the distances of y from x. This generalizes the previous problem by letting x be the pair (t,d) where t is the sampling time and d is the time-of-day (hour, in our case), and by defining some metric on the set of all such tuples which will describe well our needs. A reasonable demand would be to decide that if d is constant, the weight function on the distances will be similar to that of exponentially moving average (perhaps a continuous version of it).
The main problem is finding an efficient online algorithm that does the work in the general case, or define a specific metric which allows such an efficient online algorithm, or show that in almost any interesting case it is impossible.
EMA is essentially weighted average. When you combine several weighted averages with some weights you get a new weighted average with weights equal to products. This is exactly what you got with "time aware" EMA.
Of course, you can generalize it widely by assigning (almost arbitrary) weight as a function of "t".
As for online algorithm, you apparently want to add new points with very little efforts. EMA works nicely in this respect because EMA(x_1,...,x_n+1) = a*EMA(x_1,..., x_n) + (1-a)*x_n. You can find a lot of similar formulas for cases where weights have some symmetries or recursions (aka "group property"). Most likely, your recursive formula will have more summands in this case.

Simple algorithm to estimate probability based on past occurences?

Suppose after N occurrences, there are P times that an event happens. The "naive" approach to estimate the probability of that event happen again the next time is P/N, but obviously the higher N is, the better our estimation.
What is a practical approach to model that "sureness" in the real world? I don't need something mathematically perfect, just something to make it a little bit more realistic. For example:
if a footballer scores 9 goals in 40 matches then I want the algorithm to rate him higher than a footballer who scores 1 goal in 4 matches
a movie with a rating of 8.0 with 100k votes should be placed higher than a 8.2 movie with 2k votes
etc...
This looks like the wilson-score interval: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval. The wilson-score solves the problem how to sort a 2d array.

How can I measure trends in certain words, like Twitter?

I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.
For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678
While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321
I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.
In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49
Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!
First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N days where N is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.
In fact, the NPR article suggests this.
Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.
For the first derivative, do something like this:
d[i] = (data[i] - data[i+k])/k
for some experimentally determined value of k (which, in this case, is a number of days). Similarly, the second derivative can be expressed as:
d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)
Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k as well as discarding words with very low frequencies altogether.
I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).
I would divide by the IDF of a word to give the growth of more popular words more weight.
IDF[word] = log(D/(df[word))
where D is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word] is the number of documents containing word (e.g. the number of tweets containing a word).
A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.
The problem with your approach (measuring daily growth in percentage) is that it disregards the usual "background level" of the word, as your example shows; 'europe' grows more quickly than 'recession', yet is has a much lower score.
If the background level of words has a well-behaved distribution (Gaussian, or something else that doesn't wander too far from the mean) then I think a modification of CanSpice's suggestion would be a good idea. Work out the mean and standard deviation for each word, using days C-N+1-T to C-T, where C is the current date, N is the number of days to take into account, and T is the number of days that define a trend.
Say for instance N=90 and T=3, so we use about three months for the background, and say a trend is defined by three peaks in a row. In that case, for example, you can rank the words according to their chi-squared p-value, calculated like so:
(mu, sigma) = fitGaussian(word='europe', startday=C-N+1-3, endday=C-3)
X1 = count(word='europe', day=C-2)
X2 = count(word='europe', day=C-1)
X3 = count(word='europe', day=C)
S = ((X1-mu)/sigma)^2 + ((X2-mu)/sigma)^2 + ((X3-mu)/sigma)^2
p = pval.chisq(S, df=3)
Essentially then, you can get the words which over the last three days are the most extreme compared to their background level.
I would first try a simple solution. A simple weighted difference between adjacent day should probably work. Maybe taking the log before that. You might have to experiment with the weights. For examle (-2,-1,1,2) would give you points where the data is exploding.
If this is not enough, you can try slope filtering ( http://www.claysturner.com/dsp/fir_regression.pdf ). Since the algorithm is based on linear regression, it should be possible to modify it for other types of regression (for example quadratic).
All attempts using filtering techniques such as these also have the advantage, that they can be made to run very fast and you should be able to find libraries that provide fast filtering.

Resources