This question already has answers here:
Latency bounds and throughput bounds for processors for operations that must occur in sequence
(2 answers)
Closed 10 months ago.
In the book Computer Systems: A Programmer's Perspective, the Exercise 5.5 shows a piece of code to compute the value of a polynomial
double poly(double a[], double x, int degree)
{
long int i;
double result = a[0];
double xpwr = x;
for (i = 1; i <= degree; i++) {
result += a[i] * xpwr;
xpwr = x * xpwr;
}
return result;
}
The exercise assumes that the clock cycles that are needed by double-precision floating-point addition and multiplication are 3 and 5 respectively. The reader is asked to explain why the measured CPE (Cycles Per Element) value is 5.
According to the exercise answer, in each iteration, we need to update the variables xpwr and result, and the operations we need is a floating-point addition (for result) and a floating-point multiplication (for xpwr), so the latter dominates the latency, causing the ultimate CPE to be 5.
But I think the data flow should be something like this:
xpwr result
| |
+-----+ +--[load] |
| | | |
[mul] [mul] |
| | |
| +---+ +-----+
| | |
| [add]
| |
| +------+
| |
xpwr result
So the longest path is from the previous value of xpwr to the new value of result, going through the execution units [mul] and [add]. Therefore the longest time should be 8 cycles.
I want to ask
What is exactly the meaning of a critical path? And how to determine it?
Which answer (mine and the book) is more reasonable?
Any explanation about the CPU, architecture, execution units, pipeline, floating-point unit will be appreciated.
I know I'm a bit late to the party, but the book is absolutely right. As you can verify for yourselves by timing the code, CPE is indeed 5, so the second answer is wrong.
But the first one is also wrong. It says that the MULs must be performed at the same time which is not at all possible in the Nehalem architecture (and I suspect, most modern processors). Remember that there's only one FP MUL unit and a different FP ADD unit (as shown in the book, ed. 2011 and later)
What happens instead is this:
(LOADs are assumed always present, just 1 cycle if in cache)
First we feed xpwr *= x in MUL. Immediately after that we feed xpwr*a[i] (remember the pipeline!)
... after 5 cycles we'll get the new value of xpwr, and after 6 cycles we'll have the result of xpwr*a[i]. At that point, a new computation of xpwr *= x will be at stage 1 of MUL. So we have only 4 more cycles in which to do the rest of the ops if we don't want to be restricted by them.
Of course, that is easy as we only need 3 cycles for the FP ADD to get the new result.
So, it becomes obvious that the limiting factor is the computation of xpwr. Which means that in looking for the critical path (whatever that is) we have to look specifically at the path from the old values to the new. In this case, the path for result consists only of one FP ADD! (that's what threw me off at first too)
The critical path is indeed 8 cycles, but the question asks for the CPE, which is like the time-averaged time to output one more cycle of the loop.
Other than the first and the last cycle, the processor can do the addition from the previous iteration of the loop and the current multiplications at the same time, because the operands are not dependent on each other. The first iteration of the loop takes the full 8 cycles, but all iterations after, the loop only takes 5 cycles to run, making the actual CPE 5 cycles.
P.S. I do agree that the book's way of presenting critical path is confusing. Their definition of the critical path is not simply the path that takes the longest path, but the path also has to have operations that have operands that depends on previous operations and so have to be in order. This definition makes finding the critical path rather not intuitive.
A1: The critical path is the longest one in the data flow graph according to the book, which must be on a straight line and has an effect on a single register, rather than add up 'mul' and 'add', whose results are only intermediate operands for the next operations.
About this question, you'll get it all done if continuing to read the rest. Especially, comparing combine7's data flow graph and combine5's one can be helpful.
A2: If A1 is understood, Question2 is clear, the answer in book is reasonable.
In one iteration, there are three operation executed in parallel:
result += PREV, where PREV equals to a[i] * xwpr, which was calculated
in previous iteration, requiring a floating-point addition (3 clock cycles).
a[i] * xpwr, whose value will be used in next iteration, requiring a
floating-point multiplication (5 clock cycles).
xpwr * x, whose value will be used in next iteration, requiring a
floating-point multiplication (5 clock cycles).
As you can see, there are no data dependencies among the three operation, so
for a superscalar out-of-order CPU, they can execute in parallel, which result in a CPE of 5.
They don't necessarily start in the same clock cycle (and in fact they can't all because the CPU doesn't have 3 FP execution units), but one can start before another finishes, so their latencies can overlap. They're in flight at the same time, in the pipelined FP execution units.
The critical path is the longest path through the graph, in this case eight clocks. This is what the Dragon Book has to say about critical paths (10.3.3 Prioritized Topological Orders):
Without resource constraints, the shortest schedule is given by the
critical path, the longest path through the data-dependence graph. A
metric useful as a priority function is the height of the node, which
is the length of a longest path in the graph originating from the
node.
I think you found an error in the book. You should consider contacting the authors, so that they can correct it in future printings.
Related
I have got the following problem given: There are given n Files with length z1,...zn and a usage u1,....un where the sum of all u1,...un equals 1. 0 < u_i < 1
They want to have a sorting where the Time taken to take a file from store is minimal. Example, if z1 = 12, l2 = 3 and u1 = 0,9 and u2 = 0,1 where file 1 is taken first, the approximate time to access is 12* 0,8 + 15* 0,1
My task: Prove that this (greedy) algorithm is optimal.
My Question: Is my answer to that question correct or what should I improve?
My answer:
Suppose the Algorithm is not optimal, there has to exist an order that if more efficennt. For that there have to be two factors being mentioned. The usage and the length. The more a file is used, the shorter its time to access has to be. For that the length of files in the first places before has to be as short as possible. If the formula z1/u1 is being sorted descending. The files with a high usage will be placed lastly. Hence the accesstime is made by all lengths before * usage it means that more often files will be accessed slowly. It is a contradiction to the efficency. Suppose the formula z_i / u_i is unefficient. The division by u_i has the consequence that if more usage is given, the term will be smaller. Which means more often used terms will be accessed faster, hence u_i is < 1 and > 0. If differing from that division, terms with higher usage wont be preferred then, which would be a contradiciton to the efficiency. Also because of z_i at the top of fraction lower lengths will be preferred first. Differing from that, it would mean that terms with longer length should be preferred also. Taking longer terms first is a contradiction to efficiency. Hence all other alternatives of taking another sorting could be contradicted it can be proved that the term is optimal and correct.
Consider the following linked list:
1->2->3->4->5->6->7->8->9->4->...->9->4.....
The above list has a loop as follows:
[4->5->6->7->8->9->4]
Drawing the linked list on a whiteboard, I tried manually solving it for different pointer steps, to see how the pointers move around -
(slow_pointer_increment, fast_pointer_increment)
So, the pointers for different cases are as follows:
(1,2), (2,3), (1,3)
The first two pairs of increments - (1,2) and (2,3) worked fine, but when I use the pair (1,3), the algorithm does not seem to work on this pair. Is there a rule as to by how much we need to increment the steps for this algorithm to hold true?
Although I searched for various increment steps for the slower and the faster pointer, I haven't so far found a single relevant answer as to why it is not working for the increment (1,3) on this list.
The algorithm can easily be shown to be guaranteed to find a cycle starting from any position if the difference between the pointer increments and the cycle length are coprimes (i.e. their greatest common divisor must be 1).
For the general case, this means the difference between the increments must be 1 (because that's the only positive integer that's coprime to all other positive integers).
For any given pointer increments, if the values aren't coprimes, it may still be guaranteed to find a cycle, but one would need to come up with a different way to prove that it will find a cycle.
For the example in the question, with pointer increments of (1,3), the difference is 3-1=2, and the cycle length is 6. 2 and 6 are not coprimes, thus it's not known whether the algorithm is guaranteed to find the cycle in general. It does seem like this might actually be guaranteed to find the cycle (including for the example in the question), even though it doesn't reach every position (which applies with coprime, as explained below), but I don't have a proof for this at the moment.
The key to understanding this is that, at least for the purposes of checking whether the pointers ever meet, the slow and fast pointers' positions within the cycle only matters relative to each other. That is, these two can be considered equivalent: (the difference is 1 for both)
slow fast slow fast
↓ ↓ ↓ ↓
0→1→2→3→4→5→0 0→1→2→3→4→5→0
So we can think of this in terms of the position of slow remaining constant and fast moving at an increment of fastIncrement-slowIncrement, at which point the problem becomes:
Starting at any position, can we reach a specific position moving at some speed (mod cycle length)?
Or, more generally:
Can we reach every position moving at some speed (mod cycle length)?
Which will only be true if the speed and cycle length are coprimes.
For example, look at a speed of 4 and a cycle of length 6 - starting at 0, we visit:
0, 4, 8%6=2, 6%6=0, 4, 2, 0, ... - GCD(4,6) = 2, and we can only visit every second element.
To see this in action, consider increments of (1,5) (difference = 4) for the example given above and see that the pointers will never meet.
I should note that, to my knowledge at least, the (1,2) increment is considered a fundamental part of the algorithm.
Using different increments (as per the above constraints) might work, but it would be a move away from the "official" algorithm and would involve more work (since a pointer to a linked-list must be incremented iteratively, you can't increment it by more than 1 in a single step) without any clear advantage for the general case.
Bernhard Barker explanation is spot on.
I am simply adding on to it.
Why should the difference of speeds between the pointers and the cycle length be
coprime numbers?
Take a scenario where the difference of speeds between pointers(say v) and cycle length(say L) are not coprime.
So there exists a GCD(v,L) greater than 1 (say G).
Therefore, we have
v=difference of speeds between pointers
L=Length of the cycle(i.e. the number of nodes in the cycle)
G=GCD(v,L)
Since we are considering only relative positions, essentially the slow is stationary and the fast is moving at a relative speed v.
Let fast be at some node in the cycle.
Since G is a divisor of L we can divide the cycle into G/L parts. Start dividing from where fast is located.
Now, v is a multiple of G (say v=nG).
Every time the fast pointer moves it will jump across n parts. So in each part the pointer arrives on a single node(basically the last node of a part). Each and every time the fast pointer will land on the ending node of every part. Refer the image below
Example image
As mentioned above by Bernhard, the question we need to answer is
Can we reach every position moving at some speed?
The answer is no if we have a GCD existing. As we see the fast pointer will only cover the last nodes in every part.
I was going through my Data Structures and Algorithms notes, and came across the following examples regarding Time Complexity and Big-O Notation: The columns on the left count the number of operations carried out in each line. I didn't understand why almost all the lines in the first example have a multiple of 2 in front of them, whereas the other two examples don't. Obviously this doesn't affect the resulting O(n), but I would still like to know where the 2 came form.
I can only find one explanation for this: the sloppiness of the author of the slides.
In a proper analysis one have to explain what kind of operations are performed at which time for what input (like for example this book on page 21). Without this you can not even be sure whether we count multiplication of 2 numbers as 1 operation or 2 or something else?
These slides are inconsistent. For example:
In slide1 currentMax = A[0] takes 2 operations. Kind of makes sense if you take finding 0-th element in array as 1 operation and assigning as another one. But in the slide3 n iterations of s = s + X[i] takes n operations. Which means that s = s + X[i] takes 1 operation. Also kind of makes sense we just increase one counter.
But it is totally inconsistent with each other, because it doesn't makes sense that a = X[0] is 2 operations, and a = a + X[0] where you do more takes only 1.
I am writing an AI for a text-based game and the problem I have is that I am stuck on how to determine the thinnest section of wall. For example the following represents the 2D map, where '^' is the character who wants to get through a wall represented by '*' characters to the location marked 'X':
------------------
| X * |
|***** |
|**** |
|*** |
|** ^ |
| |
| |
| |
------------------
I've been thinking about this one for a couple days straight now and I've run out of ideas. I've tried using a* algorithm where when it tries to go through a wall character the g-cost made very high. Unfortunately the algorithm decides to never path-find through the wall.
The agent can only move left-right-up-down and not diagonally, one space at a time.
The shortest path in the above example through the wall is one, as it only has to travel through one '*' character.
I just need a couple simple ideas :)
All of the popular graph search algorithms are typically formulated with costs having some real number (i.e. float/double) cost. But this isn't a requirement. All you really need for the cost is something that has a strict ordering and operations like addition.
You could apply standard A* to this.
Define the costs to have the form (a,b)
a is number of moves on wall cells
b is number of moves on normal cells.
Define The ordering on these cells as follows:
[(a1,b1) < (a2,b2)] == [(a1<a2) || (a1==a2 && b1<b2)]
This is just a dictionary ordering where we always prefer less moves on walls before we prefer less moves on normal cells
Define the addition operation on these costs as follows:
(a1,b1) + (a2,b2) == (a1+a2,b1+b2)
Define the heuristic (i.e. the lower bound estimate on the remaining distance to goal) to be (0,b) where b is the Manhattan distance to the goal
One immediate objection might be "With that heuristic, the entire space outside of the wall will have to be explored before ever trying to pass through the wall!" -- But that's exactly what was asked for.
With the information and requirements you've given, that's actually the optimal A* heuristic.
A more complicated approach that could give significantly better best-case performance would be to combine the above with a bidirectional search. If your goal is inside of a tiny walled area, the bidirectional search can find some candidate "cheapest paths through the wall" very early on in the search.
Just make it a weighted graph, and give all the "walls" an absurdly large weight.
Be careful not to overflow your integers.
Assuming any number of moves is always cheaper than going through the wall (meaning 10000000000 non-through-wall moves is cheaper than 1 through-wall move) (otherwise setting the cost appropriately would work), I can see a special case for any algorithm I can think of that does not involve searching practically the whole map, so...
Do an exhaustive A* from the source, stopping at walls, recording the shortest path at each position.
For each explored position next to a wall, step though the wall. Then do a combined A* of all of them (where applicable), again stopping at walls.
For each of these new explored positions next to walls, step though the wall and continue the A*.
And so on... until we find the target.
Skip already explored positions - the new path should always be longer than the one already there.
The only reason you really want to do an A* instead of a BFS is because this will allow for less exploring as soon when you reach the area containing the target. Whether this is more efficient will depend on the map.
As Sint mentioned, if the start is always in a wide open area and the finish is in a small area, reversing this search would be more efficient. But this is only really applicable if you know whether it is. Detecting it is unlikely to be efficient, and once you have, you've done most of the work already and you would've lost out if both are in reasonably-sized areas.
An example:
X** |
** |
** |
** ^|
Initial BFS:
X**3
**32
**21
**10
Step through the wall and BFS (no BFS happens since they have nowhere to go but through walls):
(the ones we can ignore are marked with %)
X*4%
*4%%
*3%%
*2%%
Step into wall and BFS (BFS onto target):
65%%
5%%%
4%%%
3%%%
Use Dijkstra.
Since you're dealing with a text-based game, I find it highly unlikely that you're talking about maps larger than 1000 by 1000 characters. This will give you the guaranteed best answer with a very low cost O(n*logn), and a very simple and straightforward code.
Basically, each search state will have to keep track of two things: How many walls have you stepped through so far, and how many regular empty spaces. This can be encoded into a single integer, for both the search and the mark matrix, by assuming for instance that each wall has a cost of 2^16 to be traversed. Thus, the natural ordering of Dijkstra will ensure the paths with the least walls get tried first, and that after traversing a wall, you do not repeat paths that you had already reached without going through as many walls.
Basically, assuming a 32-bit integer, a state that has passed through 5 empty spaces and 3 walls, will look like this:
0000000000000011 0000000000000101. If your maps are really huge, maze-like, tons of walls, few walls, or whatnot, you can tweak this representation to use more or less bits for each information, or even use a longer integers if you feel more comfortable with, as this specific encoding example would "overflow" if there existed a shortest path that required walking through more than 65 thousand empty spaces.
The main advantage of using a single integer rather than two (for walls/ empty spaces) is that you can have a single, simple int mark[MAXN][MAXM]; matrix to keep track of the search. If you have reached a specific square while walking through 5 walls, you do not have to check whether you could have reached it with 4, 3 or less walls to prevent the propagation of a useless state - this information will automatically be embedded into your integer, as long as you store the amount of walls into the higher bits, you will never repeat a path while having a higher "wall cost".
Here is a fully implemented algorithm in C++, consider it pseudo-code for better visualization and understanding of the idea presented above :)
int rx[4] = {1,0,-1,0};
int ry[4] = {0,1,0,-1};
int text[height][width]; // your map
int mark[height][width]; //set every position to "infinite" cost
int parent[height][width]; //to recover the final path
priority_queue<int64_t, vector<int64_t>, greater<int64_t> > q;
int64_t state = (initial_y<<16) + initial_x;
q.push(state);
while(!q.empty()){
state = q.top(); q.pop();
int x = state & 0xFF;
int y = (state>>16) & 0xFF;
int cost = state>>32;
if(cost > mark[x][y]) continue;
if(text[x][y] == 'X') break;
for(int i = 0; i < 4; ++i){
int xx = x+rx[i];
int yy = y+ry[i];
if(xx > -1 && xx < width && yy > -1 && yy < height){
int newcost = cost;
if(text[yy][xx] == ' ') newcost += 1;
else newcost += 1<<16;
if(newcost < mark[yy][xx]){
mark[yy][xx] = newcost;
parent[yy][xx] = i; //you know which direction you came from
q.push( ((int64_t)newcost << 32) + (y<<16) + x);
}
}
}
}
// The number of walls in the final answer:
// walls = state>>48;
// steps = (state>>32) & 0xFF; // (non walls)
// you can recover the exact path traversing back using the information in parent[][]
I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.