Markov entropy when probabilities are uneven - probability

I've been thinking about information entropy in terms of the Markov equation:
H = -SUM(p(i)lg(p(i)), where lg is the base 2 logarithm.
This assumes that all selections i have equal probability. But what if the probability in the given set of choices is unequal? For example, let's say that StackExchange has 20 sites, and that the probability of a user visiting any StackExchange site except StackOverflow is p(i). But, the probability of a user visiting StackExchange is 5 times p(i).
Would the Markov equation not apply in this case? Or is there an advanced Markov variation that I'm unaware of?

I believe you are mixing up 2 concepts: entropy and the Markov equation. Entropy measures the "disorder" of a distribution on states, using the equation you gave: H = -SUM(p(i)lg(p(i)), where p(i) is the probability of observing each state i.
The Markov property does not imply that every state has the same probability. Roughly, a system is said to exhibit the Markov property if the probability to observe a state depends only on observing a few previous states - after a certain limit, the extra states you observe add no information to predicting the next state.
The prototypical Markov model is known as a Markov chain. It says that from each state i, you can move to any state with another probability, represented as a probability matrix:
0 1 2
0 0.2 0.5 0.3
1 0.8 0.1 0.1
2 0.3 0.3 0.4
In this example, the probability of moving from state 0 to 1 is 0.5, and depends only on being in state 0 (knowing more about the previous states would not change that probability).
As long as all states can be visited starting from any state, no matter what the initial distribution, the probability of being in each state converges to a stable, long-term distribution, and on a "long series" you'll observe each state with a stable probability, which is not necessarily equal for each states.
In our example, we would end up having probabilities p(0), p(1) and p(2), and you could then compute the entropy of that chain using your formula.

From your example are you thinking of Markov Chains?

Related

A sudoku problem: Efficiently find or approximate probability distribution over chosen numbers at each index of an array with no repeats

I'm looking for an efficient algorithm to generate or iteratively approximate a solution to the problem described below.
You are given an array of length N and a finite set of numbers Si for each index i of the array. Now, if we are to place a number from Si at each index i to fill the entire array, while ensuring that the number is unique across the entire array; given all the possible arrays, what is the probability ditribution over each number at each index?
Here I give an example:
Assuming we have the following array of length 3 with each column representing Si at the index of the column
4 4 4
   2  2
1  1  1
We will have the following possible arrays:
421
412
124
142
And the following probability distribution: (over 1 2 4 at each index respectively)
0.5 0.25 0.25
      0.5   0.5
0.5 0.25 0.25
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information then the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
(p.s. The reason why I am interested in this problem is because I wanted to generate probability distribution over candidate numbers for the empty cells in a sudoku and other sudoku-like games without a unique answers by only applying all the standard rules)
Sudoku is a combinatorial problem. It is easy to show that the probability of any independent cell is uniform (because you can relabel a configuration to put any number at a given position). The joint probabilities are more complicated.
If the game is partially filled you have constraints that will affect this distribution.
You must devise an algorithm to calculate the number of solutions from a given initial configuration. Then you compute the fraction of the total solutions are will have a specific value at the position of interest.
counts = {}
for i in range(1, 10):
board[cell] = i;
counts[i] = countSolutions(board);
prob = {i: counts[i] / sum(counts[i] for i in range(1, 10))}
The same approach works for joint probabilities but in some cases the number of possibilities may be too high.

Sampling from Geometric distribution in constant time

I would like to know if there is any method to sample form the Geometric distribution in constant time without using log which can be hard to approximate. Thanks.
Without relying on logarithms, there is no algorithm to sample from a geometric(p) distribution in constant expected time. Rather, on a realistic computing model, such an algorithm's expected running time must grow at least as fast as 1 + log(1/p)/w, where w is the word size of the computer in bits (Bringmann and Friedrich 2013). The following algorithm, which is equivalent to one in the Bringmann paper, generates a geometric(px/py) random number without relying on logarithms, and when px/py is very small, the algorithm is considerably faster than the trivial algorithm of generating trials until a success:
Set pn to px, k to 0, and d to 0.
While pn*2 <= py, add 1 to k and multiply pn by 2.
With probability (1− px /py)2k, add 1 to d and repeat this step.
Generate a uniform random integer in [0, 2k), call it m, then with probability (1− px /py)m, return d*2k+m. Otherwise, repeat this step.
(The actual algorithm described in the Bringmann paper is in fact much more involved than this; see my note "On a Geometric Sampler".)
REFERENCES:
Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

How do I determine the bias of an algorithm?

Let's say I have an algorithm that is supposed to represent a coin flip. How do I determine the bias of this coin? Specifically, I have written the algorithm in this JSFiddle.
The fiddle runs a series of 20 tests. Each test flips the coin 100 times and tallies the results. At the end of the series it reports Heads/Tails for the total number of flips across all tests. This result seems to be approaching 1 (from both sides), but I have not done any rigorous testing of this.
Note, this is not homework. This is purely a personal interest.
You can't come up with a way to guarantee detecting a bias, but you can determine it to a certain degree of certainty (say 95%). What you do is test n times and count how many times you get heads, call this variable h.
Then if h / n < 0.5 - 1.96 * sqrt(0.25 / n) then the coin is biased towards tails (with a 95% probability) and if h / n > 0.5 + 1.96 * sqrt(0.25 / n) then the coin is biased towards heads.
This decision is based on something called normal approximation to a binomial distribution, you can read more about it here: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval
George Marsaglia Diehard Tests
if you consider your Head/Tail as generating 0 1 bits. Using them you can generate random numbers and test their randomness

Algorithm for modeling expanding gases on a 2D grid

I have a simple program, at it's heart is a two dimensional array of floats, supposedly representing gas concentrations, I have been trying to come up with a simple algorithm that will model the gas expanding outwards, like a cloud, eventually ending up with the same concentration of the gas everywhere across the grid.
For example a given state progression could be:
(using ints for simplicity)
starting state
00000
00000
00900
00000
00000
state after 1 pass of algorithm
00000
01110
01110
01110
00000
one more pas should give a 5x5 grid all containing the value 0.36 (9/25).
I've tried it out on paper but no matter how I try, I cant get my head around the algorithm to do this.
So my question is, how should I set about trying to code this algorithm? I've tried a few things, applying a convolution, trying to take each grid cell in turn and distributing it to its neighbours, but they all end up having undesirable effects, such as ending up eventually with less gas than I originally started with, or all of gas movement being in one direction instead of expanding outwards from the centre. I really can't get my head around it at all and would appreciate any help at all.
It's either a diffusion problem if you ignore convection or a fluid dynamics/mass transfer problem if you don't. You would start with equations for conservation of mass and momentum for an Eulerian (fixed control volume) viewpoint if you were solving from scratch.
It's a transient problem, so you need to perform an integration to advance the state from time t(n) to t(n+1). You show a grid, but nothing about how you're solving in time. What integration scheme have you tried? Explicit? Implicit? Crank-Nicholson? If you don't know, you're not approaching the problem correctly.
One book that I really liked on this subject was S.W. Patankar's "Numerical Heat Transfer and Fluid Flow". It's a little dated now, but I liked the treatment. It's still good after 29 years, but there might be better texts since I was reading on the subject. I think it's approachable for somebody looking into it for the first time.
In the example you give, your second stage has a core of 1's. Usually diffusion requires a concentration gradient, so most diffusion related techniques won't change the 1 in the middle on the next iteration (nor would they have got to that state after the first one, but it's a bit easier to see once you've got a block of equal values). But as the commenters on your post say, that's not likely to be the cause of a net movement. Reducing the gas may be edge effects, but can also be a question of rounding errors - set the cpu to round half even, and total the gas and apply a correction now and again.
It looks like you're trying to implement a finite difference solver for the heat equation with Neumann boundary conditions (insulation at the edges). There's a lot of literature on this kind of thing. The Wikipedia page on finite difference method describes a simple but stable method, but for Dirichlet boundary conditions (constant density at edges). Modifying the handling of the boundary conditions shouldn't be too difficult.
It looks like what you want is something like a smoothing algorithm, often used in programs like Photoshop, or old school demo effects, like this simple Flame Effect.
Whatever algorithm you use, it will probably help you to double buffer your array.
A typical smoothing effect will be something like:
begin loop forever
For every x and y
{
b2[x,y] = (b1[x,y] + (b1[x+1,y]+b1[x-1,y]+b1[x,y+1]+b1[x,y-1])/8) / 2
}
swap b1 and b2
end loop forever
See Tom Forsyth's Game Programming Gems article. Looks like it fulfils your requirements, but if not then it should at least give you some ideas.
Here's a solution in 1D for simplicity:
The initial setup is with a concentration of 9 at the origin (), and 0 at all other positive and negative coordinates.
initial state:
0 0 0 0 (9) 0 0 0 0
The algorithm to find next iteration values is to start at the origin and average current concentrations with adjacent neighbors. The origin value is a boundary case and the average is done considering the origin value, and its two neighbors simultaneously, i.e. average among 3 values. All other values are effectively averaged among 2 values.
after iteration 1:
0 0 0 3 (3) 3 0 0 0
after iteration 2:
0 0 1.5 1.5 (3) 1.5 1.5 0 0
after iteration 3:
0 .75 .75 2 (2) 2 .75 .75 0
after iteration 4:
.375 .375 1.375 1.375 (2) 1.375 1.375 .375 .375
You do these iterations in a loop. Outputting the state every n number of iterations. You may introduce a time constant to control how many iterations represent one second of clock-on-the-wall time. This is also a function of what length units the integer coordinates represent. For a given H/W system, you can tune this value empirically. You may also introduce a steady state tolerance value to control when the program says " all neighbor values are within this tolerance" or "no value changed between iterations by more than this tolerance" and so the algorithm has reached a steady state solution.
The concentration for each iteration given a starting concentration can be obtained by the equation:
concentration = startingConcentration/(2*iter + 1)**2
iter is the time iteration. So for your example.
startingConcentration = 9
iter = 0
concentration = 9/(2*0 + 1)**2 = 9
iter = 1
concentration = 9/(2*1 + 1)**2 = 1
iter = 2
concentration = 9/(2*2 + 1)**2 = 9/25 = .35
you can set the value of the array after each "time step"

Resources