Related
I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.
In log-linear model, we can find maximum entropy solution using IIS.
We update the parameters by finding a paramters which makes model expectation over a feature and empirical expectation matches.
However, there is a exp( sum of all features) in the equation.
My question is that when the number of feature is large (say 10000) then summation over all the features will blow easily. How can we solve this problem by numerical method ? To me it seems impossible since even compute exp(50) will blow.
Do computations in log-space, and use a logsumexp operation (borrowed from scikit-learn):
// Pseudocode for 1-d version of logsumexp:
// computes log(sum(exp(x) for x in a)) in a numerically stable way.
def logsumexp(a : array of float):
amax = maximum(a)
sum = 0.
for x in a:
sum += exp(x - amax)
return log(sum) + amax
This summation can be done once, before the start of the main loop, because the feature values don't change while you're optimizing.
Side remark: IIS is quite old-fashioned. Since some 10 years, almost everyone's been using L-BFGS-B, OWL-QN or (A)SGD to fit log-linear models.
Sorry this post is not related to coding but more to data structures and Algorithms.
I'm having large amount of data each having different frequencies. The approximate figure plot seems to be a Bell curve. I now want to display the data in ranges which most precisely describes the frequency of the ranges.
e.g. the entire range of data has total no. of frequencies but this range or bucket size is not precise and may be made more precise.(e.g if some data is more concentrated in a particular frequency zone, we may build up a bucket with less data size but having more closely related frequencies.)
Any help regarding some algorithm .
I thought of an algorithm related to binary search.
Any ideas folks.
Not sure I am following, but it seems you are looking for k beans, where for each two beans, the probability of the data falling in one bean is identical for it being in the other bean.
From your description, your data seems to be normally distributed, or T-distributed.
One can evaluate the mean and standard deviation of the data, let the extracted S.D. be s and the mean be u.
The standard formulas for evaluating the mean and S.D. from the sample are1:
u = (x1 + x2 + ... + xn) / n (simple average)
s^2 = Sigma((xi - u)^2)/(n-1)
Given this information, you can evaluate the distribution of your data, which is N(u,s^2). Given this information, you can create a random variabe: X~N(u,s^2)2
Now all is left is finding the a,b,... as follows (assuming 10 buckets, this can obviously be modified as you wish):
P(X<a) = 0.1
P(X<b) = 0.2
P(X<c) = 0.3
...
After finding a,b,c,... you have your beans: (-infinity,a], (a,b], (a,c], ...
(1) evaluating variance: http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
(2)The real distribution for this variable is actually t-distribution, since the variance is unknown - and extracted from the data. However - for large enough n - t-distribution decays into normal distribution.
First count all the indexes then subtract the repeating values this will give you optimal number of buckets. but at small level
I have a cost optimization request that I don't know how if there is literature on. It is a bit hard to explain, so I apologize in advance for the length of the question.
There is a server I am accessing that works this way:
a request is made on records (r1, ...rn) and fields (f1, ...fp)
you can only request the Cartesian product (r1, ..., rp) x (f1,...fp)
The cost (time and money) associated with a such a request is affine in the size of the request:
T((r1, ..., rn)x(f1, ..., fp) = a + b * n * p
Without loss of generality (just by normalizing), we can assume that b=1 so the cost is:
T((r1, ...,rn)x(f1,...fp)) = a + n * p
I need only to request a subset of pairs (r1, f(r1)), ... (rk, f(rk)), a request which comes from the users. My program acts as a middleman between the user and the server (which is external). I have a lot of requests like this that come in (tens of thousands a day).
Graphically, we can think of it as an n x p sparse matrix, for which I want to cover the nonzero values with a rectangular submatrix:
r1 r2 r3 ... rp
------ ___
f1 |x x| |x|
f2 |x | ---
------
f3
.. ______
fn |x x|
------
Having:
the number of submatrices being kept reasonable because of the constant cost
all the 'x' must lie within a submatrix
the total area covered must not be too large because of the linear cost
I will name g the sparseness coefficient of my problem (number of needed pairs over total possible pairs, g = k / (n * p). I know the coefficient a.
There are some obvious observations:
if a is small, the best solution is to request each (record, field) pair independently, and the total cost is: k * (a + 1) = g * n * p * (a + 1)
if a is large, the best solution is to request the whole Cartesian product, and the total cost is : a + n * p
the second solution is better as soon as g > g_min = 1/ (a+1) * (1 + 1 / (n * p))
of course the orders in the Cartesian products are unimportant, so I can transpose the rows and the columns of my matrix to make it more easily coverable, for example:
f1 f2 f3
r1 x x
r2 x
r3 x x
can be reordered as
f1 f3 f2
r1 x x
r3 x x
r2 x
And there is an optimal solution which is to request (f1,f3) x (r1,r3) + (f2) x (r2)
Trying all the solutions and looking for the lower cost is not an option, because the combinatorics explode:
for each permutation on rows: (n!)
for each permutation on columns: (p!)
for each possible covering of the n x p matrix: (time unknown, but large...)
compute cost of the covering
so I am looking for an approximate solution. I already have some kind of greedy algorithm that finds a covering given a matrix (it begins with unitary cells, then merges them if the proportion of empty cell in the merge is below some threshold).
To put some numbers in minds, my n is somewhere between 1 and 1000, and my p somewhere between 1 and 200. The coverage pattern is really 'blocky', because the records come in classes for which the fields asked are similar. Unfortunately I can't access the class of a record...
Question 1: Has someone an idea, a clever simplification, or a reference for a paper that could be useful? As I have a lot of requests, an algorithm that works well on average is what I am looking for (but I can't afford it to work very poorly on some extreme case, for example requesting the whole matrix when n and p are large, and the request is indeed quite sparse).
Question 2: In fact, the problem is even more complicated: the cost is in fact more like the form: a + n * (p^b) + c * n' * p', where b is a constant < 1 (once a record is asked for a field, it is not too costly to ask for other fields) and n' * p' = n * p * (1 - g) is the number of cells I don't want to request (because they are invalid, and there is an additional cost in requesting invalid things). I can't even dream of a rapid solution to this problem, but still... an idea anyone?
Selecting the submatrices to cover the requested values is a form of the set covering problem and hence NP complete. Your problem adds to this already hard problem that the costs of the sets differ.
That you allow to permutate the rows and columns is not such a big problem, because you can just consider disconnected submatrices. Row one, columns four to seven and row five, columns four two seven are a valid set because you can just swap row two and row five and obtain the connected submatrix row one, column four to row two, column seven. Of course this will add some constraints - not all sets are valid under all permutations - but I don't think this is the biggest problem.
The Wikipedia article gives the inapproximability results that the problem cannot be solved in polynomial time better then with a factor 0.5 * log2(n) where n is the number of sets. In your case 2^(n * p) is a (quite pessimistic) upper bound for the number of sets and yields that you can only find a solution up to a factor of 0.5 * n * p in polynomial time (besides N = NP and ignoring the varying costs).
An optimistic lower bound for the number of sets ignoring permutations of rows and columns is 0.5 * n^2 * p^2 yielding a much better factor of log2(n) + log2(p) - 0.5. In consequence you can only expect to find a solution in your worst case of n = 1000 and p = 200 up to a factor of about 17 in the optimistic case and up to a factor of about 100.000 in the pessimistic case (still ignoring the varying costs).
So the best you can do is to use a heuristic algorithm (the Wikipedia article mentions an almost optimal greedy algorithm) and accept that there will be case where the algorithm performs (very) bad. Or you go the other way and use an optimization algorithm and try to find a good solution be using more time. In this case I would suggest trying to use A* search.
I'm sure there's a really good algorithm for this out there somewhere, but here are my own intuitive ideas:
Toss-some-rectangles approach:
Determine a "roughly optimal" rectangle size based on a.
Place these rectangles (perhaps randomly) over your required points, until all points are covered.
Now take each rectangle and shrink it as much as possible without "losing" any data points.
Find rectangles close to each other and decide whether combining them would be cheaper than keeping them separate.
Grow
Start off with each point in its own 1x1 rectangle.
Locate all rectangles within n rows/columns (where n may be based on a); see if you can combine them into one rectangle for no cost (or negative cost :D).
Repeat.
Shrink
Start off with one big rectangle, that covers ALL points.
Look for a sub-rectangle which shares a pair of sides with the big one, but contains very few points.
Cut it out of the big one, producing two smaller rectangles.
Repeat.
Quad
Divide the plane into 4 rectangles. For each of these, see if you get a better cost by recursing further, or by just including the whole rectangle.
Now take your rectangles and see if you can merge any of them with little/no cost.\
Also: keep in mind that sometimes it will be better to have two overlapping rectangles than one large rectangle which is a superset of them. E.g. the case when two rectangles just overlap in one corner.
Ok, my understanding of the question has changed. New ideas:
Store each row as a long bit-string. AND pairs of bit-strings together, trying to find pairs that maximise the number of 1 bits. Grow these pairs into larger groups (sort and try to match the really big ones with each other). Then construct a request that will hit the largest group and then forget about all those bits. Repeat until everything done. Maybe switch from rows to columns sometimes.
Look for all rows/columns with zero, or few, points in them. "Delete" them temporarily. Now you are looking at what would covered by a request that leaves them out. Now perhaps apply one of the other techniques, and deal with the ignored rows/cols afterwards. Another way of thinking about this is: deal with denser points first, and then move onto sparser ones.
Since your values are sparse, could it be that many users are asking for similar values? Is caching within your application an option? The requests could be indexed by a hash that is a function of (x,y) position, so that you can easily identify cached sets that fall within the correct area of the grid. Storing the cached sets in a tree, for example, would allow you to find minimum cached subsets that cover the request range very quickly. You can then do a linear lookup on the subset, which is small.
I'd consider the n records (rows) and p fields (cols) mentioned in the user request set as n points in p-dimensional space ({0,1}^p) with the ith coordinate being 1 iff it has an X, and identify a hierarchy of clusters, with the coarsest cluster at the root including all the X. For each node in the clustering hierarchy, consider the product that covers all the columns needed (this is rows(any subnode) x cols(any subnode)). Then, decide from the bottom up whether to merge the child coverings (paying for the whole covering), or keep them as separate requests. (the coverings are not of contiguous columns, but exactly the ones needed; i.e. think of a bit vector)
I agree with Artelius that overlapping product-requests could be cheaper; my hierarchical approach would need improvement to incorporate that.
I've worked a bit on it, and here is an obvious, O(n^3) greedy, symmetry breaking algorithm (records and fields are treated separately) in python-like pseudo-code.
The idea is trivial: we start by trying one request per record, and we do the most worthy merge until there is nothing left worthy to merge. This algo has the obvious disadvantage that it does not allow overlapping requests, but I expect it to work quite well on real life case (with the a + n * (p^b) + c * n * p * (1 - g) cost function) :
# given are
# a function cost request -> positive real
# a merge function that takes two pairs of sets (f1, r1) and (f2, r2)
# and returns ((f1 U f2), (r1 U r2))
# initialize with a request per record
requests = [({record},{field if (record, field) is needed}) for all needed records]
costs = [cost(request) for request in requests]
finished = False
while not finished: # there might be something to gain
maximum_gain = 0
finished = True
this_step_merge = empty
# loop onto all pairs of request
for all (request1, request2) in (requests x request) such as request1 != request2:
merged_request = merge(request1, request2)
gain = cost(request1) + cost(request2) - cost(merged_request)
if gain > maximum_gain:
maximum_gain = gain
this_step_merge = (request1, request2, merged_request)
# if we found at least something to merge, we should continue
if maximum_gain > 0:
# so update the list of requests...
request1, request2, merged_request = this_step_merge
delete request1 from requests
delete request2 from requests
# ... and we are not done yet
insert merged_request into requests
finished = False
output requests
This is O(n3 * p) because:
after initialization we start with n requests
the while loop removes exactly one request from the pool at each iteration.
the inner for loop iterates on the (ni^2 - ni) / 2 distinct pairs of requests, with ni going from n to one in the worst case (when we merge everything into one big request).
Can someone help me pointing the very bad cases of the algorithm. Does it sound reasonnable to use this one ?
It is O(n^3) which is way too costly for large inputs. Any idea to optimize it ?
Thanks in advance !
I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?
Ideally I would like to write my code a bit like the following
var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);
All I need is the actual MedianCalculator code!
Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.
Thanks
Nick
If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.
There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)
It's simple and pretty robust. The reference is here...
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
Hope this helps
Michael
I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
I would love to see an incremental mode estimator of a similar form...
(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)
Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
You have limited memory, say O(log n) where n is the number of items you want
You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.
The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.
This algorithm uses O(log n) additional space and runs in Linear time.
This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.
The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.
But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.
Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.
Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).
Now stream in new values. For each one, binary search to find the bin it belongs to.
In the common case, you just increment the population of that bin and continue.
If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it.
When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.
But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution.
You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done!
Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.
This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.
The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.
I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.
The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).
David's suggestion seems like the most sensible approach for approximating the median.
A running mean for the same problem is a much easier to calculate:
Mn = Mn-1 + ((Vn - Mn-1) / n)
Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.
In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.
In code this would look something like:
new_mean = prev_mean + ((value - prev_mean) / count)
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.
I don't think it is possible to do without having the list in memory. You can obviously approximate with
average if you know that the data is symmetrically distributed
or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue
Let MedianIndex = (N+1)/2
1st Order Binary Search:
Repeat the following 4 steps until LowValue < HighValue.
Get MedianValue approximately = ( HighValue + LowValue ) / 2
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is K = MedianIndex, then return MedianValue
is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory
2nd Order Binary Search:
LowIndex=1
HighIndex=N
Repeat Following 5 Steps until (LowIndex < HighIndex)
Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is (K=MedianIndex) ? return MedianValue
is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory
We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...
Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp
This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.
If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).
Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf
I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:
Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
i : Integer;
Begin
quantile:= mean + 1.25*sigma*(p-0.5);
q1:=quantile;
eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
dq:=abs(q1-quantile);
If dq>eta
then Begin
If dq<3*eta then eta:=eta/4;
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
end;
QuantileIterative:=quantile
end;
As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.
On homogeneous random ordered and for big enough list, this pseudo code can work:
# find min on the fly
if minDataPoint > dataPoint:
minDataPoint = dataPoint
# find max on the fly
if maxDataPoint < dataPoint:
maxDataPoint = dataPoint
# estimate median base on the current data
estimate_mid = (maxDataPoint + minDataPoint) / 2
#if **new** dataPoint is closer to the mid? stor it
if abs(midDataPoint - estimate_mid) > abs(dataPoint - estimate_mid):
midDataPoint = dataPoint
Inspired by #lakshmanaraj