Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick - algorithm

The kernel trick maps a non-linear problem into a linear problem.
My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?
Thanks.

When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.
You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...
This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.
However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.
Hopefully, these examples demonstrate part of the idea behind the kernel trick:
Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.
The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.

Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.
In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)
It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.
Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)
If we transform our points to feature space, the scalar product now looks like this:
<phi(x1), phi(x2)>
The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that
K(x1, x2) = <phi(x1), phi(x2)>
for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.
Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).

The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.
The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.
If the problem is not linear -- well, it can be anything, and you know just about nothing.
The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is
c1 * x1 + c2 * x2 + c3 * x3 = 0
(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).
The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.

Linear equations are homogeneous, and superposition applies. You can create solutions using combinations of other known solutions; this is one reason why Fourier transforms work so well. Non-linear equations are not homogeneous, and superposition does not apply. Non-linear equations usually have to be solved numerically using iterative, incremental techniques.
I'm not sure how to express the importance of the dot product, but it does take two vectors and returns a scalar. Certainly a solution to a scalar equation is less work than solving a vector or higher-order tensor equation, simply because there are fewer components to deal with.
My intuition in this matter is based more on physics, so I'm having a hard time translating to AI.

I think following link also useful ...
http://www.simafore.com/blog/bid/113227/How-support-vector-machines-use-kernel-functions-to-classify-data

Related

Obtaining the functional form of a curve

The following is the plot of a curve f(r), where r is the radial coordinate, and plotted for different values of a parameter as shown:
However, I don't know the functional form of the curve and I am interested to find the same. Are there any numerical methods which can be used to find the functional form of f(r) in terms of the radial coordinate and the parameter?
I had found a solution of the problem based on the suggestion by ja72 to use the Eureqa software which churns through the data to create accurate predictive models using evolutionary search algorithm.
In the question, the different curves corresponds to different values of . So, initially I obtained the best fit equation for different values of and found that the following model equation is suitable for my purpose:
Then, I repeated the process for a large number of values of and calculated the values of the four functions for different values of and then individually fitted these four functions. The following are the results that I obtained:
N.B.: Eureqa gave several other better fitting formulas than those mentioned in the answer. But the formulas that I mentioned are sufficiently accurate for my purpose and have minimum complexity.
A blind curve fit without an underlying model is a dangerous thing.
You need to have an understanding of the physical model behind the data to create a successful fit. The reason is that if r is distance and the best fit curve uses r^0.4072 for example, that dimension raised to a decimal power bears no meaning and it hides any underlying assumptions.Like some other dimension l not included in the model, whereas only the dimensionless quantity (r/l) would make sense to raise to the decimal power.
From a function analysis standpoint
These curves are not the result of any standard math function. Well I am not that familiar with bessel functions, gamma functions and legendre polynomials. But none of the standard functions you find in a scientific calculator jumps out here.
If r is assumed to be dimensionless, then you try to match the asymptotic behavior when r -> 0 and when r -> ∞. The would be the baseline curve. To me it does not look hyperbolic, but rather close to 1/LN(1+r).
So change the variables make g=1/LN(1+r) and plot f(r) against g(r) and see what that looks like. Then try another round of curve fitting in the new curves ... and so on.
Nobody can answer this question
Nobody else could effectively answer this question but you, because a) you have the data, and b) you need to make assumptions about what region is important or not, and what is acceptable deviation.

Optimization algorithms for piecewise-constant and similar ill-defined functions

I have a function which takes as inputs n-dimensional (say n=10) vectors whose components are real numbers varying from 0 to a large positive number A say 50,000, ends included. For any such vector the function outputs an integer from 1 to say B=100. I have this function and want to find its global minima.
Broadly speaking there are algorithmic, iterative and heuristics based approaches to tackle such optimization problem. Which are the best techniques suggested to solve this problem? I am looking for suggestions to algorithms or active research papers that i can implement from scratch to solve such problems. I have already given up hope on existing optimization functions that ship with Matlab/python. I am hoping to read experience of others working with approximation/heuristic algorithms to optimize such ill-defined functions.
I ran fmincon, fminsearch, fminunc in Matlab but they fail to optimize the function. The function is ill-defined according to their definitions. Matlab says this for fmincon:
Initial point is a local minimum that satisfies the constraints.
Optimization completed because at the initial point, the objective function is non-decreasing
in feasible directions to within the selected value of the optimality tolerance, and
constraints are satisfied to within the selected value of the constraint tolerance.
Problem arises because this function has piecewise-constant behavior. If a vector V is assigned to a number say 65, changing its components very slightly may not have any change. Such ill-defined behavior is to be well-expected because of pigeon-hole principle. The domain of function is unlimited whereas range is just a bunch of numbers.
I also wish to clarify one issue that may arise. Suppose i do gradient descent on a starting point x0 and my next x that i get from GD-iteration has some components lie outside the domain [0,50000], then what happens? So actually the domain is circular. So a vector of size 3 like [30;5432;50432] becomes [30;5432;432]. This is automatically taken care of so that there is no worry about iterations finding a vector outside the domain.

How do I extend a support vector machine algorithm to a high dimensional data set?

I'm trying to implement an SVM algorithm, but I'm having a hard time understanding how d-dimensional data sets are actually handled. In my particular case, each 'point' has nearly 400 identifying features.
In the two dimensional space, it basically tries to find a line between the two groups that maximizes the margin from any point on either side. I can sort of imagine what such a 'line' would look like in a d-dimensional space, but I'm completely lost on how the classification would actually work.
There is a similar question here, but I'm not getting it. I sort of get how the separation would occur after you have the classifier, but I'm lost on how to actually get the classifier.
If you can imagine how the line of the 2D case would become a d-dimensional hyperplane for higher dimensions, then you are pretty much done. The actual classification occurs when you test a point over the hyperplane, which will give you a positive number if the point belongs to class 1 or negative if it belongs to class 2.
Notice that in the formula there is no restriction for the dimension of each point:
[Image courtesy of wikipedia]
And in case you are curious about what happens with the non-linear case when you use the kernel trick, I would like to share with you a video that illustrates very well the idea.
http://www.youtube.com/watch?v=3liCbRZPrZA

Find all points in sphere of radius r around arbitrary coordinate

I'm looking for an efficient algorithm that for a space with known height, width and length, given a fixed radius R, and a list of points N, with 3-dimensional coordinates in that space, will find all the points within a fixed radius R of an arbitrary point on the grid. This query will be done many times with different points, so an expensive pre-processing/sorting step, in exchange for quick queries may be worth it. This is a bit of a bottleneck step of an application I'm working on, so any time I can cut off of it is useful
Things I have tried so far:
-The naive algorithm, iterate over all points and calculate distance
-Divide the space into a grid with cubes of length R, and put the points into these. That way, for each point, I only have to ever query the immediate neighboring buckets. This has a significant speedup
-I've tried using the manhattan distance as a heuristic. That is, within the buckets, before calculating a distance to any point, use the manhattan distance to filter out those that can't possibly be within radius R (that is, those with a manhattan distance of <= sqrt(3)*R). I thought this would offer a speedup, as it only needs addition instead of multiplication, but it actually slowed the program down by a little bit
EDIT: To compare the distances, I use the squared distance to eliminate having to use a sqrt function.
Obviously, there will be some limit on how much I can speed this up, but I could use any suggestions on things to try now.
Not that it probably matters on the algorithmic level, but I'm working in C.
You may get a speed benefit from storing your points in a k-d tree with three dimensions. That will give you searchs in O(log n) amortized time.
Don't compare on the radius, compare on the square of the radius. The reason being is, if the distance between two points is less than R, then the square of the distance is less than R^2.
This way, when you're using the distance formula, you don't need to compute the square root, which is a very expensive operation.
I would recommend using either K-D tree or z-curve:
http://en.wikipedia.org/wiki/Z-order_%28curve%29
How about Binary Indexed Tree ? (Topcoder tutorials referred) It can be extended to n Dimensions,and is simpler to code.
Nicolas Brodu's NEIGHAND library do exactly what you want, improving on the bin-lattice algorithm.
More details can be found in his article: Query Sphere Indexing for Neighborhood Requests
[I might be misunderstanding the question. I'm finding the problem statement difficult to parse.]
In the old days, it was often good to design a this type of algorithm with "early outs" that do tests to try to avoid a more expensive calculation. In modern processors, a failure of a branch-prediction is often very expensive, and those early-out tests can actually be more expensive that the full calculation. (The only way to know for sure is to measure.)
In this case, the calculation is pretty simple, so it may be best to avoid building a data structure or doing any clever early-out checks and instead try to optimize, vectorize, and parallelize to get the throughput you need.
For a point P(x, y, z) and a sphere S(x_s, y_s, z_s, radius), the membership test is:
(x - x_s)^2 + (x - y_s)^2 + (z - z_s)^2 < radius^2
where radius^2 can be pre-calculated once for all the points in the query (avoiding any square root calculations). These calculations are all independent, you can compute it for several points in parallel. With something like SSE, you could probably do four at a time. And if you have many points to test, you could split the list and further parallelize the work across multiple cores.

Help me understand linear separability in a binary SVM

I'm cross-posting this from math.stackexchange.com because I'm not getting any feedback and it's a time-sensitive question for me.
My question pertains to linear separability with hyperplanes in a support vector machine.
According to Wikipedia:
...formally, a support vector machine
constructs a hyperplane or set of
hyperplanes in a high or infinite
dimensional space, which can be used
for classification, regression or
other tasks. Intuitively, a good
separation is achieved by the
hyperplane that has the largest
distance to the nearest training data
points of any class (so-called
functional margin), since in general
the larger the margin the lower the
generalization error of the
classifier.classifier.
The linear separation of classes by hyperplanes intuitively makes sense to me. And I think I understand linear separability for two-dimensional geometry. However, I'm implementing an SVM using a popular SVM library (libSVM) and when messing around with the numbers, I fail to understand how an SVM can create a curve between classes, or enclose central points in category 1 within a circular curve when surrounded by points in category 2 if a hyperplane in an n-dimensional space V is a "flat" subset of dimension n − 1, or for two-dimensional space - a 1D line.
Here is what I mean:
That's not a hyperplane. That's circular. How does this work? Or are there more dimensions inside the SVM than the two-dimensional 2D input features?
This example application can be downloaded here.
Edit:
Thanks for your comprehensive answers. So the SVM can separate weird data well by using a kernel function. Would it help to linearize the data before sending it to the SVM? For example, one of my input features (a numeric value) has a turning point (eg. 0) where it neatly fits into category 1, but above and below zero it fits into category 2. Now, because I know this, would it help classification to send the absolute value of this feature for the SVM?
As mokus explained, support vector machines use a kernel function to implicitly map data into a feature space where they are linearly separable:
Different kernel functions are used for various kinds of data. Note that an extra dimension (feature) is added by the transformation in the picture, although this feature is never materialized in memory.
(Illustration from Chris Thornton, U. Sussex.)
Check out this YouTube video that illustrates an example of linearly inseparable points that become separable by a plane when mapped to a higher dimension.
I am not intimately familiar with SVMs, but from what I recall from my studies they are often used with a "kernel function" - essentially, a replacement for the standard inner product that effectively non-linearizes the space. It's loosely equivalent to applying a nonlinear transformation from your space into some "working space" where the linear classifier is applied, and then pulling the results back into your original space, where the linear subspaces the classifier works with are no longer linear.
The wikipedia article does mention this in the subsection "Non-linear classification", with a link to http://en.wikipedia.org/wiki/Kernel_trick which explains the technique more generally.
This is done by applying what is know as a [Kernel Trick] (http://en.wikipedia.org/wiki/Kernel_trick)
What basically is done is that if something is not linearly separable in the existing input space ( 2-D in your case), it is projected to a higher dimension where this would be separable. A kernel function ( can be non-linear) is applied to modify your feature space. All computations are then performed in this feature space (which can be possibly of infinite dimensions too).
Each point in your input is transformed using this kernel function, and all further computations are performed as if this was your original input space. Thus, your points may be separable in a higher dimension (possibly infinite) and thus the linear hyperplane in higher dimensions might not be linear in the original dimensions.
For a simple example, consider the example of XOR. If you plot Input1 on X-Axis, and Input2 on Y-Axis, then the output classes will be:
Class 0: (0,0), (1,1)
Class 1: (0,1), (1,0)
As you can observe, its not linearly seperable in 2-D. But if I take these ordered pairs in 3-D, (by just moving 1 point in 3-D) say:
Class 0: (0,0,1), (1,1,0)
Class 1: (0,1,0), (1,0,0)
Now you can easily observe that there is a plane in 3-D to separate these two classes linearly.
Thus if you project your inputs to a sufficiently large dimension (possibly infinite), then you'll be able to separate your classes linearly in that dimension.
One important point to notice here (and maybe I'll answer your other question too) is that you don't have to make a kernel function yourself (like I made one above). The good thing is that the kernel function automatically takes care of your input and figures out how to "linearize" it.
For the SVM example in the question given in 2-D space let x1, x2 be the two axes. You can have a transformation function F = x1^2 + x2^2 and transform this problem into a 1-D space problem. If you notice carefully you could see that in the transformed space, you can easily linearly separate the points(thresholds on F axis). Here the transformed space was [ F ] ( 1 dimensional ) . In most cases , you would be increasing the dimensionality to get linearly separable hyperplanes.
SVM clustering
HTH
My answer to a previous question might shed some light on what is happening in this case. The example I give is very contrived and not really what happens in an SVM, but it should give you come intuition.

Resources