Primal weights of non-linear SVM - algorithm

I am experimenting with different kinds of non-linear kernels and am trying to interpret the learned models, which led me to the following question: Is there a generic method for getting the primal weights of a non-linear Support Vector machine similar to how this is possible for linear SVMs (see related question)?
Say, you have three features a, b, c and the generated model of an all-subsets/polynomial kernel. Is there a way to extract the primal weight of those subsets, e.g., a * b and a^2?
I've tried extending the method for linear kernels, where you generate output for the following samples:
a, b, c
[0, 0, 0]
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
If I use the same approach for the all-subsets kernel, I can generate some more samples:
a, b, c
[1, 1, 0]
[1, 0, 1]
...
Next, to calculate the primal weight for a * b, I analyse the predictions as follows: [1, 1, 0] - ([1, 0, 0] + [0, 1, 0] + [0, 0, 0]).
The problem I see with this is that it requires a prohibitive number of samples, doesn't address the subsets such as a^2 and it doesn't generalise to other non-linear kernels.

No. I don't claim to be the end-all-be-all expert on this, but I've done a lot of reading and research on SVM and I do not think what you are saying is possible. Sure, in the case of the 2nd degree polynomial kernel you can enumerate the feature space induced by the kernel, if the number of attributes is very small. For higher-order polynomial kernels and larger numbers of attributes this quickly becomes intractable.
The power of the non-linear SVM is that it is able to induce feature spaces without having to do computation in that space, and in fact without actually knowing what that feature space is. Some kernels can even induce an infinitely dimensional feature space.
If you look back at your question, you can see part of the issue - you are looking for the primal weights. However, the kernel is something that is introduced in the dual form, where the data shows up as a dot product. Mathematically reversing this process would involve breaking the kernel function apart - knowing the mapping function from input space to feature space. Kernel functions are powerful precisely because we do not need to know this mapping. Of course it can be done for linear kernels, because there is no mapping function used.

Extracting the weights of the explicit features is generally not computationally feasible, but a decent next-best-thing is the pre-image: generating a sample z such that its features correspond to the weights you're after.
This can be described formally as finding z such that phi(z) = w, with the weights implicitly defined as a combination of training samples, as is usual with the kernel trick: w=sum_i(alpha_i * phi(x_i)). phi is the feature map.
In general an exact pre-image does not exist, and most methods find the pre-image that minimizes the squared-error.
A good description of the classical Gauss-Newton iteration for pre-images of Gaussian kernels, as well as another more general method based on KPCA, is in:
James T. Kwok, Ivor W. Tsang, "The Pre-Image Problem in Kernel Methods", ICML 2003
Direct link: http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_KwokT03a.pdf

Related

Algorithm to find desired direction with minimum amount of iterations

There's three components to this problem:
A three dimensional vector A.
A "smooth" function F.
A desired vector B (also three dimensional).
We want to find a vector A that when put through F will produce the vector B.
F(A) = B
F can be anything that somehow transforms or distorts A in some manner. The point is that we want to iteratively call F(A) until B is produced.
The question is:
How can we do this, but with the least amount of calls to F before finding a vector that equals B (within a reasonable threshold)?
I am assuming that what you call "smooth" is tantamount to being differentiable.
Since the concept of smoothness only makes sense in the rational / real numbers, I will also assume that you are solving a floating point-based problem.
In this case, I would formulate the problem as a nonlinear programming problem. i.e. minimizing the squared norm of the difference between f(A) and B, given by
(F(A)_1 -B_1)² + (F(A)_2 - B_2)² + (F(A)_3 - B_3)²
It should be clear that this expression is zero if and only if f(A) = B and positive otherwise. Therefore you would want to minimize it.
As an example, you could use the solvers built into the scipy optimization suite (available for python):
from scipy.optimize import minimize
# Example function
f = lambda x : [x[0] + 1, x[2], 2*x[1]]
# Optimization objective
fsq = lambda x : sum(v*v for v in f(x))
# Initial guess
x0 = [0,0,0]
res = minimize(fsq, x0, tol=1e-6)
# res.x is the solution, in this case
# array([-1.00000000e+00, 2.49999999e+00, -5.84117172e-09])
A binary search (as pointed out above) only works if the function is 1-d, which is not the case here. You can try out different optimization methods by adding the method="name" to the call to minimize, see the API. It is not always clear which method works best for your problem without knowing more about the nature of your function. As a rule of thumb, the more information you give to the solver, the better. If you can compute the derivative of F explicitly, passing it to the solver will help reduce the number of required evaluations. If F has a Hessian (i.e., if it is twice differentiable), providing the Hessian will help as well.
As an alternative, you can use the least_squares function on F directly via res = least_squares(f, x0). This could be faster since the solver can take care of the fact that you are solving a least squares problem rather than a generic optimization problem.
From a more general standpoint, the problem of restoring the function arguments producing a given value is called an Inverse Problem. These problems have been extensively studied.
Provided that F(A)=B, F,B are known and A remains unknown, you can start with a simple gradient search:
F(A)~= F(C) + F'(C)*(A-C)~=B
where F'(C) is the gradient of F() evaluated in point C. I'm assuming you can calculate this gradient analytically for now.
So, you can choose a point C that you estimate it is not very far from the solution and iterate by:
C= Co;
While(true)
{
Ai = inverse(F'(C))*(B-F(C)) + C;
convergence = Abs(Ai-C);
C=Ai;
if(convergence<someThreshold)
break;
}
if the gradient of F() cannot be calculated analytically, you can estimate it. Let Ei, i=1:3 be the ortonormal vectors, then
Fi'(C) = (F(C+Ei*d) - F(C-Ei*d))/(2*d);
F'(C) = [F1'(C) | F2'(C) | F3'(C)];
and d can be chosen as fixed or as a function of the convergence value.
These algorithms suffer from the problem of local maxima, null gradient areas, etc., so in order for it to work, the start point (Co) must be not very far from the solution where the function F() behaves monotonically
it seems like you can try a metaheuristic approach for this.
Genetic algorithm (GA) might be the best suite for this.
you can initiate a number of A vector to init a population, and use GA to make evolution on your population, so you will have better generation in which they have better vectors that F(Ax) closer to B.
Your fitness function can be a simple function that compare F(Ai) to B
You can choose how to mutate your population by each generation.
A simple example about GA can be found here link

transform a matrix to the reduced row echelon form in Ruby

I'm working on an open source project written in ruby, and I have hit an area where an algorithm requires the use of Linear Algebra. I'm looking for a gem to transform a matrix to the reduced row echelon form.
Basically following this (very detailed) series of steps:
http://www.math.odu.edu/~bogacki/cgi-bin/lat.cgi?c=rref
to convert
require 'matrix'
Matrix[[12, 0, -1, 0], [26, 0, 0, -2], [0, 2, -2, -1]]
to
Matrix[[1,0,0,-1/13],[0,1,0,-37/26],[0,0,1,-12/13]]
Can this be accomplished with standard ruby libraries in few steps? Or does a linear algebra gem exist?
Does this help - http://rubyforge.org/projects/linalg/ ?
Basic description reads - Linalg is a fast, LAPACK-based library for real and complex matrices. Current functionality includes: singular value decomposition, eigenvectors and eigenvalues of a general matrix, least squares, LU, QR, Schur, Cholesky, stand-alone LAPACK bindings.

Similarity algorithm (mathematics) of sampled signals

Let's say I have sampled some signals and constucted a vector of the samples for each. What is the most efficent way to calculate (dis)similarity of those vectors? Note that offset of the sampling must not count, for instance sample-vectors of sin and cos -signals should be considered similar since in sequential manner they are exately the same.
There is a simple way of doing this by "rolling" the units of the other vector, calculating euclidian distance for each roll-point and finally choosing the best match (smallest distance). This solution works fine since the only target for me is to find most similar sample-vector for input signal from a vector pool.
However, the solution above is also very inefficent when the dimension of the vectors grow. Compared to "non-sequential vector matching" for N-dimensional vector, the sequential one would have N-times more vector distance calculations to do.
Is there any higher/better mathematics/algorithms to compare two sequences with differing offsets?
Use case for this would be in sequence similarity visualization with SOM.
EDIT: How about comparing each vector's integrals and entropies? Both of them are "sequence-safe" (= time-invariant?) and very fast to calculate but I doubt they alone are enough to distinguish all possible signals from each other. Is there something else that could be used in addition for these?
EDIT2: Victor Zamanian's reply isn't directly the answer but it gave me an idea that might be. The solution might be to sample the original signals by calculating their Fourier transform coefficents and inserting those into sample vectors. First element (X_0) is the mean or "level" of the signal and the following (X_n) can be directly used to compare similarity with some other sample vector. The smaller the n is, the more it should have effect in similarity calculations, since the more coefficents there has been calculated with FT, the more accurate representation will the FT'd signal be. This brings up an bonus question:
Let's say we have FT-6 sampled vectors (values just fell out of the sky)
X = {4, 15, 10, 8, 11, 7}
Y = {4, 16, 9, 15, 62, 7}
Similarity value of these vectors could MAYBE be calculated like this: |16-15| + (|10 - 9| / 2 ) + (|8 - 15| / 3) + (|11-62| / 4) + (|7-7| / 5)
Those bolded ones are the bonus question. Is there some coefficents/some other way to know how much each FT-coefficent has effect on the similarity in relation to other coefficents?
If I understand your question correctly, maybe you would be interested in some type of cross-correlation implementation? I'm not sure if it's the most efficient thing to do or fits the purpose, but I thought I would mention it since it seems relevant.
Edit: Maybe a Fast Fourier Transform (FFT) could be an option? Fourier transforms are great for distinguishing signals from each other and I believe helpful to find similar signals too. E.g. a sine and a cosine wave would be identical in the real plane, and just have different imaginary parts (phase). FFTs can be done in O(N log N).
Google "translation invariant signal classificiation" and you'll find things like these.

Looking for algorithm: Clustering by 'similarity'

I have a set of 'vectors' and i need to sort them basing on their 'similarity'.
Like this: vectors {1,0,0} {1,1,0} {0,1,0} {1,0,1} are pretty similiar and should be close to each other in the end, but vectors {1, 0, 0} {8, 0, 0} {0, 5, 0} - are not.
The metric between A and B is max(abs(A[i]-B[i])), but what kind of algorithms can sort things basing on relative comparison?
upd:
input: array of N vectors
ouput: array of N vectors, where nearest by index vectors(arr[i] arr[i+1] for example) are 'similiar' = metric between arr[i] and arr[i+1] is as low as possibly for any i, j.
metric - maximum difference of vector components
upd2:
as it seems now, #jogojapan was right - i need to cluster vectors and after, print them in some linear order, group by group
That's a distance induced by max norm (aka sup norm or l-infinity norm). A distance is not enough to create a linear ordering, if by sorting you mean ordring in a sequence.
Sorting is inherently a one-dimensional problem. What you're describing here sounds more like a weighted graph but it's not clear what your goal is. You may also find some concepts from information theory such as Hamming Distance to be useful if you're trying to identify the vector which is "closest" to a known vector.
Well, the obvious approach would be the (IMHO badly named) "hierarchical clustering", which always merges those clusters with the smallest distance. You can plug in your metric there. Most implementations are in O(n^3) and thus not useful for large datasets. Plus, you get a huge dendrogram that is hard to read.
You might want to give OPTICS a try. Look it up on Wikipedia. It might satisfy your needs quite well, since it in fact sorts the points. It will walk from one cluster to another, and can in fact produce a hierarchical (as in "nested") clustering. A good implementation should run in O(n^2) without index structures and in O(n log n) with index acceleration.
Any sorting algorithm can give you the results you want.
The question is how you are going to compare your vectors. Do you just want to compare them by magnitude? Or something else?

Algorithm Question Maximize Average of Functions

I have a set of N non-decreasing functions each denoted by Fi(h), where h is an integer. The functions have numeric values.
I'm trying to figure out a way to maximize the average of all of the functions given some total H value.
For example, say each function represents a grade on an assignment. If I spend h hours on assignment i, I will get g = Fi(h) as my grade. I'm given H hours to finish all of the assignments. I want to maximize my average grade for all assignments.
Can anyone point me in the right direction to figure this out? I just need a generic algorithm in pseudo code and then I can probably adapt quickly from that.
EDIT: I think dynamic programming could be used to figure this out but I'm not really 100% sure.
EDIT 2: I found an example in my algorithms book from when I was in university that is almost the exact same problem take a look here on Google Books.
I don't know about programming, but in mathematics functions of functions are called functionals, and the pertinent math is calculus of variations.
Have a look at linear programming, the section on integer programming
Genetic Algorithms are sometimes used for this sort of a thing, but the result you'll get won't be optimal, but near it.
For a "real" solution (I always feel genetics is sort of cheating) if we can determine some properties of the functions (Is function X rising? Do any of them have asymptotes we can abuse? etc.), then you need to design some analyzing mechanism for each function, and take it from there. If we have no properties for any of them, they could be anything. My math isn't excellent, but those functions could be insane factorials^99 that is zero unless your h is 42 or something.
Without further info, or knowledge that your program could analyze and get some info. I'd go genetics. (It would make sense to apply some analyzing function on it, and if you find some properties you can use, use them, otherwise turn to the genetic algorithm)
If the functions in F are monotonically increasing in their domains then parametric search is applicable (search for Meggido).
Have a look at The bounded knapsack problem and the dynamic programming algorithm given.
I have one question: how many functions and how many hours do you have ?
It seems to me that an exhaustive search would be quite suitable if none is too high.
The Dynamic Programming application is quite easy, first consider:
F1 = [0, 1, 1, 5] # ie F1[0] == 0, F1[1] == 1
F2 = [0, 2, 2, 2]
Then if I have 2 hours, my best method is to do:
F1[1] + F2[1] == 3
If I have 3 hours though, I am better off doing:
F1[3] + F2[0] == 5
So the profile is anarchic given the number of hours, which means that if a solution exists it consists in manipulating the number of functions.
We can thus introduce the methods one at a time:
R1 = [0, 1, 1, 5] # ie maximum achievable (for each amount) if I only have F1
R2 = [0, 2, 3, 5] # ie maximum achievable (for each amount) if I have F1 and F2
Introducing a new function takes O(N) time, where N is the total number of hours (of course I would have to store the exact repartition...)
Thus, if you have M functions, the algorithm is O(M*N) in terms of number of functions execution.
Some functions may not be trivial, but this algorithm performs caching implicitly: ie we only evaluate a given function at a given point once!
I suppose we could be better if we were able to use the increasing property into consideration, but I daresay I am unsure about the specifics. Waiting for a cleverer fellow!
Since it's homework, I'll refrain from posting the code. I would just note that you can "store" the repartition if your R tables are composed of pairs (score,nb) where nb indicates the amount of hours used by the latest method introduced.

Resources