Tensorflow: running same computational graph with different random samples efficiently - random

In Tensorflow, the computational graph can be made to depend on random variables. In scenarios where the random variable represents a single sample from a distribution, it can be of interest to compute the quantity with N separate samples to e.g. make a sample estimate with less variance.
Is there a way to run the same graph with different random samples, reusing as many of the intermediate calculations as possible?
Possible solutions:
create the graph and the random variable inside a loop. Con: makes redundant copies of quantities that don't depend on the random variable.
Extend the random variable with a batch dimension. Con: a bit cumbersome. Seems like something TF should be able to do automatically.
Maybe the graph editor (in .contrib) can be used to make copies with different noise, but I'm unsure whether this is any better than the looping.
Ideally, there should just be an op that reevaluates the random variable or marks the dependency as unfulfilled, forcing it to sample a new quantity. But this might very well not be possible.
Example of the looping strategy:
import tensorflow as tf
x = tf.Variable(1.)
g = []
f = lambda x: tf.identity(x) #op not depending on noise
for i in range(10):
a = tf.random_normal(()) # random variable
y = tf.pow(f(x)+a*x, 2) # quantity to be repeated for diff samples
g += [y]
#here, the mean is the quantity of interest
m = tf.reduce_mean(g)
#the variance demonstrates that the samples are different
v = tf.reduce_mean(tf.map_fn(lambda x: tf.square(x-m), tf.stack(g)))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(g))
print('variance: {}'.format(sess.run(v)))
if f(x) is an expensive function, I can only assume that the loop would make a lot of redundant calculation.

Related

How can I generate a random sparse matrix with a specific probability of symmetric entries?

I'm working on a program that sorts individuals into teams based on a sparse matrix with binary entries, each entry corresponding to whether or not i is willing to work with j and so on. I have the program running, but I need to be able to test it on random matrices to observe some relationships between the results and the parameters.
What I'd like to find is some way to generate a matrix that has a a certain number of non-zero entries per row and a certain probability of symmetrical entries. That is, I want to be able to assign a specific number for P(w_ji = 1 | w_ij = 1) and use that to generate a matrix. I don't want symmetric matrices, but modeling this with completely random matrices would be inaccurate, since a real-world willingness matrix tends to be at least somewhat symmetric.
Does anyone know of anything I could use to generate such a matrix? I generally use python (with gurobi) and am open to installing any number of other libraries to help if I have to. If anyone else here uses gurobi, I would appreciate input on whether or not I could model matrix generation like this as an optimization problem using something like this for an objective function:
min <= sum(w[i,j] * w[j,i] for i in... for j in...) <= max
Thank you!
If all you want is a coefficient matrix with random distribution of 0 and 1 values, the easiest option is to pick a probability and do Bernoulli trials as to whether the value is 1. (If it is zero, omit the element for sparseness).
Alternately, if you need a random permutation of a fixed number of 0's and 1's, then try something like:
import random
n = 50
k = 10
positions = sorted(random.sample(range(n), k))
The list positions represents the nonzero elements you need.
With a matrix representation, this would be a good candidate for the Gurobi matrix variable object, MVar.

MATLAB optimization: speed up computation on large matrices

I am using the following function:
kernel = #(X,Y,sigma) exp((-pdist2(X,Y,'euclidean').^2)./(2*sigma^2));
to compute a series of kernels, in the following way:
K = [(1:size(featureVectors,1))', kernel(featureVectors,featureVectors, sigma)];
However, since featureVectors is a huge matrix (something like 10000x10000), it takes really a long time to compute the kernels (e.g., K).
Is it possible to somehow speed up the computation?
EDIT: Context
I am using a classifier via libsvm, with a gaussian kernel, as you may have noticed from the variable names and semantics.
I am using now (more or less) #terms~=10000 and #docs~=10000. This #terms resulted after stopwords removal and stemming. This course indicates that having 10000 features makes sense.
Unfortunately, libsvm does not implement automatically the Gaussian kernel. Thus, it is required to compute it by hand. I took the idea from here, but the kernel computation (as suggested by the referenced question) is really slow.
You are using pdist2 with two equal input arguments (X and Y are equal when you call kernel). You could save half the time by computing each pair only once. You do that using pdist and then squareform:
kernel = #(X,sigma) exp((-squareform(pdist(X,'euclidean')).^2)./(2*sigma^2));
K = [(1:size(featureVectors,1))', kernel(featureVectors, sigma)];
Your exponential function will go down very fast. For distances of several sigma your kernel function will essentially be zero. These cases we can sort out and become faster.
function z = kernel(X, Y, sigma)
d = pdist2(X,Y,'euclidean');
z = zeros(size(d)); % start with zeros
m = d < 3 * sigma;
z(m) = exp(-d(m).^2/(2*sigma^2));
end

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

Is there an O(1) algorithm for generating the result of a series of random events?

Let's say I've got a routine that, when invoked, will use a RNG and return True 30% of the time, or False otherwise. That's fairly simple. But what if I wanted to simulate how many True results I'd get if I called that routine 10 billion times?
Calling it 10 billion times in a loop would take too long. Multiplying 10 billion by 30% would yield the statistically expected result of 3 billion, but there would be no actual randomness involved. (And the odds that the result would be exactly 3 billion aren't all that great.)
Is there an algorithm for simulating the aggregate result of such a series of random events, such that if it were called multiple times, the results it gave would show the same distribution curve as actually running the random series it's simulating multiple times, that runs in O(1) time (ie. does not take longer to run as the length of the series to be simulated increases)?
I would say - it can be done in O(1)!
Binomial distribution which describes your situation can (in some circumstances) be approximated by normal distribution. It can be done when both n*p and n*(1-p) are greater then 5, so for p=0.3 it can be done for all n > 17. When n is getting really big (like millions) that approximation is getting better and better.
A random number with normal distribution can be easily calculated using Box–Muller transform. All you need to do that are two random numbers between 0 and 1. Box-Muller transform gives two random numbers from N(0,1) distribution, called standard normal. N(μ, σ2) can be achieved using X = μ + σZ formula, where Z is standard normal.
After a deeper thought I can present this Python solution, which works in O(log(n)) and does not use any approximation. However, for large n the solution of #MarcinJuraszek is more suitable.
First I generate a Python list with values of Cumulative Binomial Distribution function.
Later I use Inverse Transform Sampling.
The cost of the first step is O(n) -- but you have to do it only once. The cost of the second step is just O(log(n)) -- which is essentially a cost of a binary search. As the code has many dependencies, you can take a look at this screenshot:
import numpy.random as random
import matplotlib.pyplot as pyplot
import scipy.stats as stats
import bisect
# This is the number of trials.
size = 6;
# this generates in memory an object, which contains
# a full information on desired binomial
# distribution. The object has to be generated only once.
# THIS WORKS IN O(n).
binomialInstance = stats.binom(size, 0.3)
# this pulls a probabilty mass function in form of python list
binomialTable = [binomialInstance.pmf(i) for i in range(size + 1)]
# this pulls a python list from binomialInstance, first
# processing it to produce a cumulative distribution function.
binomialCumulative = [binomialInstance.cdf(i) for i in range(size + 1)]
# this produces a plot of dots: first argument is x-axis (just
# subsequent integers), second argument is our table.
pyplot.plot([i for i in range(len(binomialTable))], binomialTable, 'ro')
pyplot.figure()
pyplot.plot([i for i in range(len(binomialCumulative))], binomialCumulative, 'ro')
# now, we can cheaply draw a sample from our distribution.
# we can use bisect to draw a random answer.
# THIS WORKS IN log(n).
cutOff = random.random(1)
print "this is our cut-off value: " + str(cutOff)
print "this is a number of successful trials: " + str(bisect.bisect(binomialCumulative, cutOff))
pyplot.show()
As other reviewers mentioned you can use binomial distribution. But, since you are dealing with a very large number of samples you should consider using normal distribution approximation.

Biasing random number generator to some integer n with deviation b

Given an integer range R = [a, b] (where a >=0 and b <= 100), a bias integer n in R, and some deviation b, what formula can I use to skew a random number generator towards n?
So for example if I had the numbers 1 through 10 inclusively and I don't specify a bias number, then I should in theory have equal chances of randomly drawing one of them.
But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.
And if I specify a deviation of say 2 in addition to the bias number, then the number generator should be drawing from 1 through 5 a more frequently than 6 through 10.
What algorithm can I use to achieve this?
I'm using Ruby if it makes it any easier/harder.
i think the simplest route is to sample from a normal (aka gaussian) distribution with the properties you want, and then transform the result:
generate a normal value with given mean and sd
round to nearest integer
if outside given range (normal can generate values over the entire range from -infinity to -infinity), discard and repeat
if you need to generate a normal from a uniform the simplest transform is "box-muller".
there are some details you may need to worry about. in particular, box muller is limited in range (it doesn't generate extremely unlikely values, ever). so if you give a very narrow range then you will never get the full range of values. other transforms are not as limited - i'd suggest using whatever ruby provides (look for "normal" or "gaussian").
also, be careful to round the value. 2.6 to 3.4 should all become 3, for example. if you simply discard the decimal (so 3.0 to 3.999 become 3) you will be biased.
if you're really concerned with efficiency, and don't want to discard values, you can simply invent something. one way to cheat is to mix a uniform variate with the bias value (so 9/10 times generate the uniform, 1/10 times return 3, say). in some cases, where you only care about average of the sample, that can be sufficient.
For the first part "But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.", a very easy solution:
def randBias(a,b,biasedNum=None, bias=0):
x = random.randint(a, b+bias)
if x<= b:
return x
else:
return biasedNum
For the second part, I would say it depends on the task. In a case where you need to generate a billion random numbers from the same distribution, I would calculate the probability of the numbers explicitly and use weighted random number generator (see Random weighted choice )
If you want an unimodal distribution (where the bias is just concentrated in one particular value of your range of number, for example, as you state 3), then the answer provided by andrew cooke is good---mostly because it allows you to fine tune the deviation very accurately.
If however you wish to make several biases---for instance you want a trimodal distribution, with the numbers a, (a+b)/2 and b more frequently than others, than you would do well to implement weighted random selection.
A simple algorithm for this was given in a recent question on StackOverflow; it's complexity is linear. Using such an algorithm, you would simply maintain a list, initial containing {a, a+1, a+2,..., b-1, b} (so of size b-a+1), and when you want to add a bias towards X, you would several copies of X to the list---depending on how much you want to bias. Then you pick a random item from the list.
If you want something more efficient, the most efficient method is called the "Alias method" which was implemented very clearly in Python by Denis Bzowy; once your array has been preprocessed, it runs in constant time (but that means that you can't update the biases anymore once you've done the preprocessing---or you would to reprocess the table).
The downside with both techniques is that unlike with the Gaussian distribution, biasing towards X, will not bias also somewhat towards X-1 and X+1. To simulate this effect you would have to do something such as
def addBias(x, L):
L = concatList(L, [x, x, x, x, x])
L = concatList(L, [x+2])
L = concatList(L, [x+1, x+1])
L = concatList(L, [x-1,x-1,x-1])
L = concatList(L, [x-2])

Resources