How is the decision boundary piloted after the parameters theta are updated - algorithm

I have been learning about Machine learning algorithms this semester but I cant seem to understand how the parameters theta are used once Gradient decent is ran and they are updated, specifically in Logistic regression, In short my question is how is the decision boundary piloted after the parameters theta are updated.

After you use gradient descent to estimate your parameters theta, you can use those calculated parameters to make predictions.
For any input x, you can now calculate an predicted outcome y.
Ultimately the goal of machine learning is to make predictions.
So you take a whole bunch of observations x and y. Where x is your input and y is your output. In case of logistic regression, y is one of two values. For example, take a bunch of emails (x) that are labeled spam or no spam (y is 1 for spam and 0 for no spam). Or take a bunch of medical images that are labeled healthy or non healthy. ...
Feed all that data in your machine learning algorithm. Your algorithm (gradient descent for example), will calculate the theta coefficients.
Now you can use these theta coefficient to make predictions for new values of x. For example a new email that the system has never seen, using the theta coefficient, you can predict whether it is spam or not.
As far a plotting the decision boundary. This is probably feasible when you have two dimensions for x. You can have one dimension on each axis. And the resulting dots in your graph would be your y values. You could color them differently or show a different shape whether the result is one way or the other (i.e. your y is 0 or 1).
In practicality, these plots are useful during a lecture to get a general gist of what you're trying to do or accomplish. In reality, every input X would probably be a vector of many values (way more than 2). And thus it becomes impossible to plot a decision boundary.

Typically, logistic regression is parametrized in a following way:
cl(x|theta) = 1 / (1 + exp(-SUM_{i=1}^d theta_i x_i + theta_0 )) ) > 0.5
which is equivalent to
cl(x|theta) = sign(SUM_{i=1}^d theta_i x_i + theta_0 )
so once you get your theta, you use it to make a prediction by computing a simple weighted sum of your data representation and you check the sign of such number.

Related

Algorithm to approximate non-linear equation system solution

I'm looking for an algorithm to approximate the solution of the following equation system:
The equations have to be solved on an embedded system, in C++.
Background:
We measure the 2 variables X_m and Y_m, so they are known
We want to compute the real values: X_r and Y_r
X and Y are real numbers
We measure the functions f_xy and f_yx during calibration. We have maximal 18 points of each function.
It's possible to store the functions as a look-up table
I tried to approximate the functions with 2nd order polynomials and compute the solution, but it was not accurate enough, because of the fitting error.
I am looking for an algorithm to approximate the results in an embedded system in C++, but I don't even know what to search for. I found some papers on the theory link, but I think there must be an easier way to do it in my case.
Also: how can I determine during calibration, whether the functions can be solved with the algorithm?
Fitting a second-order polynomial through f_xy? That's generally not viable. The go-to solution would be Runga-Kutta interpolation. You pick two known values left and two to the right of your argument, with weights 1,2,2,1. This gets you an estimate d(f_xy)/dx which you can then use for interpolation.
The normal way is by Newton's iterations, starting from the initial approximation (Xm, Ym) [assuming that the f are mere corrections]. Due to the particular shape of the equations, you can reduce to twice a single equation in a single unknown.
Xr = Xm - Fyx(Ym - Fxy(Xr))
Yr = Ym - Fxy(Xm - Fyx(Yr))
The iterations read
Xr <-- Xr - (Xm - Fyx(Ym - Fxy(Xr))) / (1 + Fxy'(Ym - Fxy(Xr)).Fxy'(Xr))
Yr <-- Yr - (Ym - Fxy(Xm - Fyx(Yr))) / (1 + Fyx'(Xm - Fyx(Yr)).Fyx'(Yr))
So you should tabulate the derivatives of f as well, though accuracy is not so critical than for the computation of the f themselves.
If the calibration points aren't too noisy, I would recommend cubic spline interpolation, for which you can precompute all coefficients. At the same time these coefficients allow you to estimate the derivative (as the corresponding quadratic interpolant, which is continuous).
In principle (unless the points are uniformly spaced), you need to perform a dichotomic search to determine the interval in which the argument lies. But here you will evaluate the functions at nearby values, so that a linear search from the previous location should be better.
A different way to address the problem is by considering the bivariate solution surfaces Xr = G(Xm, Ym) and Yr = G(Xm, Ym) that you compute on a grid of points. If the surfaces are smooth enough, you can use a coarse grid.
So by any method (such as the one above), you precompute the solutions at each grid node, as well as the coefficients of some interpolant in the X and Y directions. I recommend a cubic spline, again.
Now to interpolate inside a grid cell, you combine the two univarite interpolants to a bivariate one by means of the Coons formula https://en.wikipedia.org/wiki/Coons_patch.

How can I encode double values in genetic algorithm?

I want to use neural network to learning cars riding on the racetrack. Imo best way to learning net is using genetic algorithm, but in each tutorial the genotype is encode by 0 and 1 (binary values). In my net weights are double values, so genotype looks like 3,12; 9,12; 0,83, -0,73 etc.
So my question is:
Should I encode each weight to binary value ? I think I can use double values but I don't know how can I mutate this ? Binary value I can inverse from 0 to 1 and from 1 to 0 but double ?
From a theoretical point of view yes, you can.
The condition anyhow is you correctly define all the operations (like crossover, mutation, etc.) for continuous values too.
The answer then is yes, if your software implementation enables you to do so.
Let me draw a simplified example.
If the algorithm aims at identifying the best fit for the sine function, and you can use shapes [triangle, square, half-circle], y magnitude and x displacement, you can have a chromosome of let's say N shapes to be summed together.
In such a case x and y must be both double: you can mutate them e.g. by adding a random number in a sensate range, and perform crossover by exchanging x or y with the partner, or even collecting a full tuple (shape-x-y).
I would say that the mantra is to keep things coherent, and let individuals mutate in a sensate way for your model (a bad choice would be cross x with y).

Inverse inference on bayesian piece wise linear regression model in pymc

I am trying to perform inverse inference on a simple bayesian network for piece wise linear regression. That is, y is a piece wise linear function of x :Plot of Y vs X
and the Bayesian network looks like this: Bayesian Network Model
Here, X has a normal distribution, K is a discrete node that has a softmax distribution conditioned on X and Y is a mixture of linear gaussians based on the value of K (i.e. Pr(Y | K=i, X=x) ~ N(mu=w_i*x+b_i, s_i)).
I have learned the parameters of this model using EM algorithm. (The actual relationship of Y and X has five linear pieces, but I have learnt using 8 levels for the discrete node). And formed the pymc model using those parameters. Here is the code:
x=pymc.Normal('x', mu=0.5, tau=1.0/0.095)
#The probabilities of discrete node given x=x; Softmax distribution
epower = [-11.818,54.450,29.270,-13.038,73.541,28.466,-57.530,-101.568]
bias = [7.8228,-35.3859,-12.9512,12.8004,-48.1097,-13.2229,30.6079,39.3811]
#pymc.deterministic(plot=False)
def prob(epower=epower,bias=bias,x=x):
pr=[np.exp(ep*x+bb) for ep, bb in zip(epower, bias)]
return [pri/np.sum(pr) for pri in pr]
knode=pymc.Categorical('knode', p=prob)
#The weights of regression
wtsY=[15.022, -70.000, -14.996, 15.026, -70.000, -14.996, 34.937, 15.027]
#The unconditional means of Y
meansY=[5.9881,68.0000,23.9973,5.9861,68.0000,23.9972,-1.9809,1.9982]
sigmasY=[0.010189,0.010000,0.010033,0.010211,0.010000,0.010036,0.010380,0.010167]
#pymc.deterministic(plot=False)
def condmeanY(knode=knode, x=x,wtsY=wtsY, meansY=meansY):
return wtsY[knode]*x + meansY[knode]
#pymc.deterministic(plot=False)
def condsigmaY(knode=knode, sigmasY=sigmasY):
return sigmasY[knode]
y=pymc.Normal('y', mu=condmeanY, tau=1.0/condsigmaY, value=13.5, observed=True)
I want to predict x, when y is observed (inverse inference). As y is (approximately) non-linear in x, there will be multiple solutions for a given value of y. I expect that the obtained trace of x should show those multiple solutions. I have ensured that autocorrelation is very low (sample=2000, burn=1000). But I am not able to see multiple solutions. In the above example, for y=13.5, there are two possible solutions, x=0.5 and x=0.7. But the chain only wanders near 0.5. The histogram has only one peak, at 0.5.
Am I missing something?
EDIT: I came across this very relevant question:Solving inverse problems with PyMC. What I learned from the answer is that the prior of x, which I am assuming to be uni-modal Gaussian here, should have a non-parametric distribution and then the obtained samples after first iteration can be used to update it. Kernel density estimation (with gaussian kernel) has been suggested to obtain non-parametric stochastic from data. I incorporated this in my model but still there is no difference. One thing I noted is that if I do the inference multiple times, approx 50% of the times, I get 0.5, and 50% of the times, I get 0.7 (I am not sure if this was the case earlier as well, because I had not run that model many times to observe this.) But still, should I not see two peaks in the trace after first iteration only?
I also tried with a modified version of this model, where the edge from X to K is reversed. This is a classical conditional linear Gaussian model. Even with this model, I could not get multiple solutions visible in the trace. I am sort of stuck here. Please help.

optimize integral f(x)exp(-x) from x=0,infinity

I need a robust integration algorithm for f(x)exp(-x) between x=0 and infinity, with f(x) a positive, differentiable function.
I do not know the array x a priori (it's an intermediate output of my routine). The x array is typically ~log-equispaced, but highly irregular.
Currently, I'm using the Simpson algorithm, buy my problem is that often the domain is highly undersampled by the x array, which produces unrealistic values for the integral.
On each run of my code I need to do this integration thousands of times (each with a different set of x values), so I need to find an efficient and robust way to integrate this function.
More details:
The x array can have between 2 and N points (N known). The first value is always x[0] = 0.0. The last point is always a value greater than a tunable threshold x_max (such that exp(x_max) approx 0). I only know the values of f at the points x[i] (though the function is a smooth function).
My first idea was to do a Laguerre-Gauss quadrature integration. However, this algorithm seems to be highly unreliable when one does not use the optimal quadrature points.
My current idea is to add a set of auxiliary points, interpolating f, such that the Simpson algorithm becomes more stable. If I do this, is there an optimal selection of auxiliary points?
I'd appreciate any advice,
Thanks.
Set t=1-exp(-x), then dt = exp(-x) dx and the integral value is equal to
integral[ f(-log(1-t)) , t=0..1 ]
which you can evaluate with the standard Simpson formula and hopefully get good results.
Note that piecewise linear interpolation will always result in an order 2 error for the integral, as the result amounts to a trapezoid formula even if the method was Simpson. For better errors in the Simpson method you will need higher interpolation degrees, ideally cubic splines. Cubic Bezier polynomials with estimated derivatives to compute the control points could be a fast compromise.

Number of Passes for Perceptron

I am trying to implement Perceptron Algorithm but I am not able to figure out following points.
what should be ideal value for iteration number
is this algorithm suitable for large volumes of data?
does threshold changes with iterations?
if yes what difference does it make in final output?
The Perceptron is not a specific algorithm, it's a name of a cluster of algorithms. There're 2 major differences between these algorithms.
1. Integrate and fire rule
Let the input vector be x, the weights vector be w, the threshold be t and the output value be P(x). There're various function to calculate P(x):
binary: P(x) = 1 (if w * x>=t) or 0 (otherwise)
semi-linear: P(x) = w * x (if w * x>=t) or 0 (otherwise)
hard limit: P(x) = t(if w * x>=t) or w * x (if 0<w * x<t) or 0 (otherwise)
Sigmoid: P(x) = 1 / 1+e^(w * x)
and many others. So it's hard to say what difference does the threshold make in final ouptut, because it depends on which integrate and fire function you use.
2. Learning rule
The learning rule of the Perceptron is various too. The most simple and common one is
w -> w+ a * x* (D(x)- P(x))
where a is the size of a learning step, and D(x) is the expected output to x. So it's also hard to say that what should be a ideal value of iterations, because it depends on the value of a and how many training samples you do have.
Therefore, does thresold changes with iterations? it also depends. The simple and common learning rule above doesn't modify the threshold while training, but there're some other learning rules do modify it.
Btw, you also asked that is this algorothm suitable for large volume of data? The main metrics to measure the suitability of a classifier for certain dataset, is the linear separability of the dataset, not the scale of it. Keep in mind that the Single-layer Perceptron has very bad performance for the datasets which are not linear separable. The scale of the dataset does NOT that matter.

Resources