Inverse inference on bayesian piece wise linear regression model in pymc - pymc

I am trying to perform inverse inference on a simple bayesian network for piece wise linear regression. That is, y is a piece wise linear function of x :Plot of Y vs X
and the Bayesian network looks like this: Bayesian Network Model
Here, X has a normal distribution, K is a discrete node that has a softmax distribution conditioned on X and Y is a mixture of linear gaussians based on the value of K (i.e. Pr(Y | K=i, X=x) ~ N(mu=w_i*x+b_i, s_i)).
I have learned the parameters of this model using EM algorithm. (The actual relationship of Y and X has five linear pieces, but I have learnt using 8 levels for the discrete node). And formed the pymc model using those parameters. Here is the code:
x=pymc.Normal('x', mu=0.5, tau=1.0/0.095)
#The probabilities of discrete node given x=x; Softmax distribution
epower = [-11.818,54.450,29.270,-13.038,73.541,28.466,-57.530,-101.568]
bias = [7.8228,-35.3859,-12.9512,12.8004,-48.1097,-13.2229,30.6079,39.3811]
#pymc.deterministic(plot=False)
def prob(epower=epower,bias=bias,x=x):
pr=[np.exp(ep*x+bb) for ep, bb in zip(epower, bias)]
return [pri/np.sum(pr) for pri in pr]
knode=pymc.Categorical('knode', p=prob)
#The weights of regression
wtsY=[15.022, -70.000, -14.996, 15.026, -70.000, -14.996, 34.937, 15.027]
#The unconditional means of Y
meansY=[5.9881,68.0000,23.9973,5.9861,68.0000,23.9972,-1.9809,1.9982]
sigmasY=[0.010189,0.010000,0.010033,0.010211,0.010000,0.010036,0.010380,0.010167]
#pymc.deterministic(plot=False)
def condmeanY(knode=knode, x=x,wtsY=wtsY, meansY=meansY):
return wtsY[knode]*x + meansY[knode]
#pymc.deterministic(plot=False)
def condsigmaY(knode=knode, sigmasY=sigmasY):
return sigmasY[knode]
y=pymc.Normal('y', mu=condmeanY, tau=1.0/condsigmaY, value=13.5, observed=True)
I want to predict x, when y is observed (inverse inference). As y is (approximately) non-linear in x, there will be multiple solutions for a given value of y. I expect that the obtained trace of x should show those multiple solutions. I have ensured that autocorrelation is very low (sample=2000, burn=1000). But I am not able to see multiple solutions. In the above example, for y=13.5, there are two possible solutions, x=0.5 and x=0.7. But the chain only wanders near 0.5. The histogram has only one peak, at 0.5.
Am I missing something?
EDIT: I came across this very relevant question:Solving inverse problems with PyMC. What I learned from the answer is that the prior of x, which I am assuming to be uni-modal Gaussian here, should have a non-parametric distribution and then the obtained samples after first iteration can be used to update it. Kernel density estimation (with gaussian kernel) has been suggested to obtain non-parametric stochastic from data. I incorporated this in my model but still there is no difference. One thing I noted is that if I do the inference multiple times, approx 50% of the times, I get 0.5, and 50% of the times, I get 0.7 (I am not sure if this was the case earlier as well, because I had not run that model many times to observe this.) But still, should I not see two peaks in the trace after first iteration only?
I also tried with a modified version of this model, where the edge from X to K is reversed. This is a classical conditional linear Gaussian model. Even with this model, I could not get multiple solutions visible in the trace. I am sort of stuck here. Please help.

Related

Where is the gaussian distribution function in the pseudocode below?

I was working on my final assignment, and I raised Box Muller Gaussian Distribution method to look for random numbers in unity software.
I am very confused about the gaussian distribution function on the pseudocode that I found in one of the journals.
Pseudocode algoritma Box-Muller(Sukajaya dkk., 2012) :
a. Generate uniform random number u, v in range [-1, 1]
b. Calculate s = u2 + v2
c. Looping step 2 until s < 1
d. Find normal random numbers `z0 = u. √((-2lns)/s)` and z1 = v . √(- (-2lns)/s)
I think the pseudocode only talks about the Box Muller and the Gaussian Distribution function is only for displaying diagrams of randomized numbers.
The Box-Muller algorithm does not contain a direct implementation of the Gaussian density formula. Instead, it produces outcomes which (cumulatively) follow that density. The results z0 and z1 produced by the algorithm are two independent Gaussian random values. If you iterate the algorithm hundreds or thousands of times and build a histogram of all the z values, it will start looking like the bell-shaped curve of a Gaussian distribution. The math behind it is beyond the scope of a StackOverflow post, so I'm going to advise that you just push the "I believe!" button, or see the Wikipedia article if you want more explanation and links to various original sources.
I'm not sure what you mean when you say "the Gaussian Distribution function is only for displaying diagrams of randomized numbers." The Gaussian is one of the most important modeling distributions out there because sums of values from all other distributions with finite variance will converge to the Gaussian in distribution. That means if you're studying averages (which are built from sums) or aggregates of lots of little errors, the Gaussian distribution does a great job of characterizing the results.

How is the decision boundary piloted after the parameters theta are updated

I have been learning about Machine learning algorithms this semester but I cant seem to understand how the parameters theta are used once Gradient decent is ran and they are updated, specifically in Logistic regression, In short my question is how is the decision boundary piloted after the parameters theta are updated.
After you use gradient descent to estimate your parameters theta, you can use those calculated parameters to make predictions.
For any input x, you can now calculate an predicted outcome y.
Ultimately the goal of machine learning is to make predictions.
So you take a whole bunch of observations x and y. Where x is your input and y is your output. In case of logistic regression, y is one of two values. For example, take a bunch of emails (x) that are labeled spam or no spam (y is 1 for spam and 0 for no spam). Or take a bunch of medical images that are labeled healthy or non healthy. ...
Feed all that data in your machine learning algorithm. Your algorithm (gradient descent for example), will calculate the theta coefficients.
Now you can use these theta coefficient to make predictions for new values of x. For example a new email that the system has never seen, using the theta coefficient, you can predict whether it is spam or not.
As far a plotting the decision boundary. This is probably feasible when you have two dimensions for x. You can have one dimension on each axis. And the resulting dots in your graph would be your y values. You could color them differently or show a different shape whether the result is one way or the other (i.e. your y is 0 or 1).
In practicality, these plots are useful during a lecture to get a general gist of what you're trying to do or accomplish. In reality, every input X would probably be a vector of many values (way more than 2). And thus it becomes impossible to plot a decision boundary.
Typically, logistic regression is parametrized in a following way:
cl(x|theta) = 1 / (1 + exp(-SUM_{i=1}^d theta_i x_i + theta_0 )) ) > 0.5
which is equivalent to
cl(x|theta) = sign(SUM_{i=1}^d theta_i x_i + theta_0 )
so once you get your theta, you use it to make a prediction by computing a simple weighted sum of your data representation and you check the sign of such number.

Two different Covariance Matrices?

I am a little bit confused!
Assume we have observed the Data X = [x1,..,xn] and they are vectors in R^d (with zero mean)
X^T denotes the transposed of X
Sometimes i see that the covariance matrix is in the form of 1/n * X*X^T (e.g. Principal Component Analysis) and sometimes is see it in the form 1/n * X^T*X (e.g. Kernel-Covariance matrix with kernel k(x,y) = x^T*y)
So why are 2 different ways or am i mixing up some things? Thank you for your help.
Well, the results differ in their dimension. One is a nxn-matrix, the other is a dxd-matrix.
I don't know the application for nxn-result, but when I used the covariance matrix to denote the variation of a vector in R^d (with measurements X = [x1,..,xn]) the result has to be a dxd-matrix, whose eigenvectors and -values indicate the main axes and extends of an "variance ellipsoid" (which must be given in dxd)
PS: Only half an answer, I know
Addendum:
Kernels are used for creating inner products of pairwise features, thus reducing the dimension to 1 to find patterns more easily. Have a look at
http://en.wikipedia.org/wiki/Kernel_principal_component_analysis#Introduction_of_the_Kernel_to_PCA
to get an impression, what the kernel covariance matrix is used for

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Is there a special type of multivariate regression for multiple-parameter predictions?

I am trying using multivariate regression to play basketball. Specificlly, I need to, based on X, Y, and distance from the target, predict the pitch, yaw, and cannon strength. I was thinking of using multivariate regression with multipule variables for each of the output parameter. Is there a better way to do this?
Also, should I use solve directly for the best fit, or use gradient descent?
ElKamina's answer is correct but one thing to note about this is that it is identical to doing k independent ordinary least squares regressions. That is, the same as doing a separate linear regression from X to pitch, from X to yaw, and from X to strength. This means, you are not taking advantage of correlations between the output variables. This may be fine for your application, but one alternative that does take advantage of correlations in the output is reduced rank regression(a matlab implementation here), or somewhat related, you can explicitly uncorrelate y by projecting it onto its principle components (see PCA, also called PCA whitening in this case since you aren't reducing the dimensionality).
I highly recommend chapter 6 of Izenman's textbook "Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning" for a fairly high level overview of these techniques. If you're at a University it may be available online through your library.
If those alternatives don't perform well, there are many sophisticated non-linear regression methods that have multiple output versions (although most software packages don't have the multivariate modifications) such as support vector regression, Gaussian process regression, decision tree regression, or even neural networks.
Multivariate regression is equivalent to doing the inverse of the covariance of the input variable set. Since there are many solutions to inverting the matrix (if the dimensionality is not very high. Thousand should be okay), you should go directly for the best fit instead of gradient descent.
n be the number of samples, m be the number of input variables and k be the number of output variables.
X be the input data (n,m)
Y be the target data (n,k)
A be the coefficients you want to estimate (m,k)
XA = Y
X'XA=X'Y
A = inverse(X'X)X'Y
X' is the transpose of X.
As you can see, once you find the inverse of X'X you can calculate the coefficients for any number of output variables with just a couple of matrix multiplications.
Use any simple math tools to solve this (MATLAB/R/Python..).

Resources