Understanding Gradient Descent Algorithm - algorithm

I'm learning Machine Learning. I was reading a topic called Linear Regression with one variable and I got confused while understanding Gradient Descent Algorithm.
Suppose we have given a problem with a Training Set such that pair $(x^{(i)},y^{(i)})$ represents (feature/Input Variable, Target/ Output Variable). Our goal is to create a hypothesis function for this training set, Which can do prediction.
Hypothesis Function:
$$h_{\theta}(x)=\theta_0 + \theta_1 x$$
Our target is to choose $(\theta_0,\theta_1)$ to best approximate our $h_{\theta}(x)$ which will predict values on the training set
Cost Function:
$$J(\theta_0,\theta_1)=\frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2$$
$$J(\theta_0,\theta_1)=\frac{1}{2}\times Mean Squared Error$$
We have to minimize $J(\theta_0,\theta_1)$ to get the values $(\theta_0,\theta_1)$ which we can put in our hypothesis function to minimize it. We can do that by applying Gradient Descent Algorithm on the plot $(\theta_0,\theta_1,J(\theta_0,\theta_1))$.
My question is how we can choose $(\theta_0,\theta_1)$ and plot the curve $(\theta_0,\theta_1,J(\theta_0,\theta_1))$. In the online lecture, I was watching. The instructor told everything but didn't mentioned from where the plot will come.

At each iteration you will have some h_\theta, and you will calculate the value of 1/2n * sum{(h_\theta(x)-y)^2 | for each x in train set}.
At each iteration h_\theta is known, and the values (x,y) for each train set sample is known, so it is easy to calculate the above.
For each iteration, you have a new value for \theta, and you can calculate the new MSE.
The plot itself will have the iteration number on x axis, and MSE on y axis.
As a side note, while you can use gradient descent - there is no reason. This cost function is convex and it has a singular minimum that is well known: $\theta = (X^T*X)^{-1)X^Ty$, where yis the values of train set (1xn dimension for train set of size n), and X is 2xn matrix where each line X_i=(1,x_i).

Related

Inverse of Laplacian and Gaussian Noise

Given a set of data points, I modify the data points by adding a Laplacian or a Gaussian Noise to them.
I am wondering if there exist mathematical inverse functions able to derive the original data points from the ones with noise.
My understanding is that, we can reconstruct only an estimation of the original data points that have a certain probability p of being equal to the original data points.
If this is the case, how to calculate such a probability p?

Uncertainty on pose estimate when minimizing measurement errors

Let's say I want to estimate the camera pose for a given image I and I have a set of measurements (e.g. 2D points ui and their associated 3D coordinates Pi) for which I want to minimize the error (e.g. the sum of squared reprojection errors).
My question is: How do I compute the uncertainty on my final pose estimate ?
To make my question more concrete, consider an image I from which I extracted 2D points ui and matched them with 3D points Pi. Denoting Tw the camera pose for this image, which I will be estimating, and piT the transformation mapping the 3D points to their projected 2D points. Here is a little drawing to clarify things:
My objective statement is as follows:
There exist several techniques to solve the corresponding non-linear least squares problem, consider I use the following (approximate pseudo-code for the Gauss-Newton algorithm):
I read in several places that JrT.Jr could be considered an estimate of the covariance matrix for the pose estimate. Here is a list of more accurate questions:
Can anyone explain why this is the case and/or know of a scientific document explaining this in details ?
Should I be using the value of Jr on the last iteration or should the successive JrT.Jr be somehow combined ?
Some people say that this actually is an optimistic estimate of the uncertainty, so what would be a better way to estimate the uncertainty ?
Thanks a lot, any insight on this will be appreciated.
The full mathematical argument is rather involved, but in a nutshell it goes like this:
The outer product (Jt * J) of the Jacobian matrix of the reprojection error at the optimum times itself is an approximation of the Hessian matrix of least squares error. The approximation ignores terms of order three and higher in the Taylor expansion of the error function at the optimum. See here (pag 800-801) for proof.
The inverse of the Hessian matrix is an approximation of the covariance matrix of the reprojection errors in a neighborhood of the optimal values of the parameters, under a local linear approximation of parameters-to-errors transformation (pag 814 above ref).
I do not know where the "optimistic" comment comes from. The main assumption underlying the approximation is that the behavior of the cost function (the reproj. error) in a small neighborhood of the optimum is approximately quadratic.

Build a linear approximation for an unknown function

I have some unknown function f(x), I am using matlab to calculate 2000 points on the function graph. I need a piecewise linear function g containing 20 to 30 segments, and it fits best to the original function, how could I do this in an acceptable way? The possible solution space is impossible to traverse and can't think of a good heuristic function to effectively shrink it.
Here is the code from which the function is derived:
x = sym('x', 'real');
inventory = sym('inventory', 'real');
demand = sym('demand', 'real');
f1 = 1/(sqrt(2*pi))*(-x)*exp(-(x - (demand - inventory)).^2./2);
f2 = 20/(sqrt(2*pi))*(x)*exp(-(x - (demand - inventory)).^2./2);
expectation_expression = int(f1, x, -inf, 0) + int(f2, x, 0, inf);
Depending on what your idea of a good approximation is, there may be a dynamic programming solution for this.
For example, given 2000 points and corresponding values, we wish to find the piecewise linear approximation with 20 segments which minimizes the sum of squared deviations between the true value at each point and the result of the linear approximation.
Work along the 2000 points from left to right, and at each point calculate for i=1 to 20 the total error from the far left to that point for the best piecewise linear approximation using i segments.
You can work out the values at position n+1 using the values calculated for points to the left of that position - points 1..n. For each value of i, consider all points to its left - say points j < n+1. Work out the error contributions resulting from a linear segment running from point j to point n+1. Add to that the value you have worked out for the best possible error using i-1 segments at point j (or possibly point j-1 depending on exactly you you define your piecewise linear approximation). If you now take the minimum such value over all possible j, you have calculated the error from the best possible piecewise linear approximation using i segments for the first n+1 points.
When you have worked out the best value for the first 2000 points using 20 segments you have solved the problem, and you can work back along this table to find out where the segments are - or, if this is inconvenient, you can save extra information as you go along to make this easier.
I believe similar approaches will minimize the sum of absolute deviations, or minimize the maximum deviation at any point, subject to you being able to solve the corresponding problems for a single line. I have implicitly assumed you can fit a straight line to minimize the sum of squared errors, which is of course a standard sum of squares line fit. Minimizing the absolute deviations from a straight lines is an exercise in convex optimization which I would attempt by repeatedly weighted least squares. Minimizing the maximum absolute deviation is linear programming.

Find the diameter of a set of n points in d-dimensional space

I am interesting in finding the diameter of two points sets, in 128 dimensions. The first has 10000 points and the second 1000000. For that reason I would like to do something better than the naive approach which takes O(n²). The algorithm will be able to handle any number of points and dimensions, but I am currently very interested in these two particular data sets.
I am very interesting in gaining speed over accuracy, thus, based on this, I would find the (approximate) bounding box of the point set, by computing the min and max value per coordinate, thus O(n*d) time. Then, if I find the diameter of this box, the problem is solved.
In the 3d case, I could find the diameter of the one side, since I know the two edges and then, I could apply the Pythagorean theorem on the other, which is vertical to this side. I am not sure for this however and for sure, I can't see how to generalize it to d dimensions.
An interesting answer can be found here, but it seems to be specific for 3 dimensions and I want a method for d dimensions.
Interesting paper: On computing the diameter of a point set in high dimensional Euclidean space. Link. However, implementing the algorithm seems too much for me in this phase.
The classic 2-approximation algorithm for this problem, with running time O(nd), is to choose an arbitrary point and then return the maximum distance to another point. The diameter is no smaller than this value and no larger than twice this value.
I would like to add a comment, but not enough reputation for that...
I just want to warn other readers that the "bounding box" solution is very inaccurate. Take for example the Euclidean ball of radius one. This set has diameter two, but its bounding box is [-1, 1]^d, which has diameter twice the square root of d. For d = 128, this is already a very bad approximation.
For a crude estimate, I would stay with David Eisenstat's answer.
There is a precision based algorithm which performs very well on any dimension, which is based on computing the dimension of an axial bounding box.
The idea is that it's possible to find the lower and upper boundaries of the axis bounding box length function since it's partial derivatives are limited, and depend on the angle between the axises.
The limit of the local maxima derivatives between two axises in 2d space can be computed as:
sin(a/2)*(1 + tan(a/2))
That means that, for example, for 90deg between axises the boundary is 1.42 (sqrt(2))
Which reduces to a/2 when a => 0, so the upper boundary is proportional to the angle.
For a multidimensional case the formula varies slightly, but still it's easy to compute.
So, the search of local minima convolves in logarithmic time.
The good news is that we can run the search of such local maxima in parallel.
Also, we can filter out both the regions of the search based on the best achieved result so far, as well as the points themselves, which are belo the lower limit of the search in the worst region.
The worst case of the algorithm is where all of the points are placed on the surface of a sphere.
This can be firther improved: when we detect a local search which operates on just few points, we swap to bruteforce for this particular axis. It works fast, because we need only the points which are subject to that particular local search, which can be determined as points actually bound by two opposite spherical cones of a particular angle sharing the same axis.
It's hard to figure out the big O notation, because it depends on desired precision and the distribution of points (bad when most of the points are on a sphere's surface).
The algorithm i use is here:
Set the initial angle a = pi/2.
Take one axis for each dimension. The angle and the axises form the initial 'bucket'
For each axis, compute the span on that axis by projecting all the points onto the axis, and finding min and max of the coordinates on the axis.
Compute the upper and lower bounds of the diameter which is interesting. It's based on the formula: sin(a/2)*(1 + tan(a/2)) and multiplied by assimetry cooficient, computed from the length of the current axis projections.
For the next step, kill all of the points which fall under the lower bound in each dimension at the same time.
For each exis, If the amount of points above the upper bound is less then some reasonable amount (experimentally computed) then compute using a bruteforce (N^2) on the set of the points in question, and adjust the lower bound, and kill the axis for the next step.
For the next step, Kill all of the axises, which have all of their points under the lower bound.
If the precision is satisfactory (upper bound - lower bound) < epsilon, then return the upper bound as the result.
For all of the survived axises, there is a virtual cone on that axis (actually, the two opposite cones), which covers some area on a virtual sphere which encloses a face of the cube. If i'm not mistaken, it's angle would be a * sqrt(2). Set the new angle to a / sqrt(2). Create a whole bucket of new axises (2 * number of dimensions), so the new cone areas would cover the initial cone area. It's the hard part for me, as i have not enough imagination for n>3-dimensional case.
Continue from step (3).
You can paralellize the procedure, synchronizing the limits computed so far for the points from (5) through (7).
I'm going to summarize the algorithm proposed by Timothy Shields.
Pick random point x.
Pick point y furthest from x.
If not done, let x = y, and go to step 2
The more times you repeat, the more accurate the result will be... ??
EDIT: actually this algorithm is not very good. Think about a 2D rectangle with vertices ABCD. There are two maxima: between AC and BD, which are separated by a sizable valley. This algorithm will get stuck at one or the other 50/50. If AC is slightly larger than BD, you'll be getting the wrong answer 50% of the time no matter how many times you iterate. Other regular polygons have the same issue, and in higher dimensions it is even worse.

Particle Filter Resampling

I implemented a bootstrap Particle filter on C++ by reading few Papers and I first implemented a 1D mouse tracker which performed really well. I used normal Gaussian for weighting in this exam.
I extended the algorithm to track face using 2 features of Local motion and HSV 32 bin Histogram. In this example my weighing function becomes the probability of Motion x probability of Histogram. (Is this correct).
Incase if that is correct than I am confused on the resampling function. At the moment my resampling function is as follows:
For each Particle N = 50;
Compute CDF
Generate a random number (via Gaussian) X
Update the particle at index X
Repeat for all N particles.
This is my re-sampling function at the moment. Note: the second step I am using a Random Number via Gaussian distribution for get the index while my weighting function is Probability of Motion and Histogram.
My question is: Should I generate random number using the probability of Motion and Histogram or just the random number via Gaussian is ok.
In the SIR (Sequential Importance Resampling) particle filter, resampling aims to replicate particles that have gained high weight, while remove those with less weight.
So, when you have your particles weighted (typically with the likelihood you have used), one way to do resampling is to create the cumulative distribution of the weights, and then generate a random number following a uniform distribution and pick the particle corresponding to the slot of the CDF. This way there is more probability to select a particle that has more weight.
Also, don't forget to add some noise after generating replicas of particles, otherwise your point-estimate might be biased for a period of time.

Resources