Single Perceptron - Non-linear Evaluating function - algorithm

In the case of a single perceptron - literature states that it cannot be used for seperating non-linear discriminant cases like the XOR function. This is understandable since the VC-dimension of a line (in 2-D) is 3 and so a single 2-D line cannot discriminate outputs like XOR.
However, my question is why should the evaluating function in the single perceptron be a linear-step function? Clearly if we have a non-linear evaluating function like a sigmoid, this perceptron can discriminate between the 1s and 0s of XOR. So, am I missing something here?

if we have a non-linear evaluating function like a sigmoid, this perceptron can discriminate between the 1s and 0s of XOR
That's not true at all. The criteria for discrimination is not the shape of the line (or hyperplane in higher dimensions), but rather whether the function allows linear separability.
There is no single function that produces a hyperplane capable of separating the points of the XOR function. The curve in the image separates the points, but it is not a function.
To separate the points of XOR, you'll have to use at least two lines (or any other shaped functions). This will require two separate perceptrons. Then, you could use a third perceptron to separate the intermediate results on the basis of sign.

I assume by sigmoid you don't actually mean a sigmoid, but something with a local maximum. Whereas the normal perceptron binary classifier is of the form:
f(x) = (1 if w.x+b>0 else 0)
you could have a function:
f(x) = (1 if |w.x+b|<0.5 else 0)
This certainly would work, but would be fairly artificial, in that you effectively are tailoring your model to your dataset, which is bad.
Whether the normal perceptron algorithm would converge is almost certainly out of the question, though I may be mistaken. http://en.wikipedia.org/wiki/Perceptron#Separability_and_convergence You might need to come up with a whole new way to fit the function, which sort of defeats the purpose.
Or you could just use a support vector machine, which is like perceptron, but is able to handle more complicated cases by using the kernel trick.

Old question, but i want to leave my thoughts (anyone correct me if i'm wrong).
I think you're mixed the concepts of linear model and loss or error function.
The Perceptron is by definition a linear model, so it defines a line/plane/hyperplane which you can use to separate your classes.
The standard Perceptron algorithm extract the signal of your output, giving -1 or 1:
yhat = signal(w * X + w0)
This is fine and will eventually converge if your data is linearly separable.
To improve this you can use a sigmoid to smooth the loss function in the range [-1, 1]:
yhat = -1 + 2*sigmoid(w * X + w0)
mean_squared_error = (Y - yhat)^2
Then use a numerical optimizer like Gradient Descent to minimize the error over your entire dataset. Here w0, w1, w2, ..., wn are your variables.
Now, if your original data is not linearly separable, you can transform it in a way which makes it linearly separable and then apply any linear model. This is true because the model is linear on the weights.
This is basically what models like SVM do under the hoods do to classify your non-linear data.
PS: I'm learning this stuff too, so experts don't be mad at me if i said some crap.

Related

what is the numerical method to guarantee safer matrix inversion?

I am trying to develop an algorithm (in the framework of gradient descent )for an SEM(structural equation model) problem.There is a parameter matrix B(n*n) with all its diagonal elements fixed to be zero.And a term of inv(I-B) (inversion of I - B) in my objective function.There is no other constraint such as symmetry on B.
My question is that how can we make sure (I-B) is not singular in the iterations?
In this problem,because the domain of the objective function is not the whole R^n space,it seems that the strict conditions for the convergence of gradient descent will be not satisfied.Standard textbooks will assume the objective to have a domain in the whole R^n space.It seems that gradient descent will not have a guaranteed convergence.
In the update of the iterative algorithms,currently my implementation is that checking whether (I-B) is close to singular, then if it is not, the step size of the gradient descent will be shrunk.Is there any better numerical approach to dealing with this problem?
You can try putting a logarithmic barrier on det(I-B)>0 or det(I-B)<0 depending on which gives you a better result, or if you have more info about your problem . the gradient of logdet is quite nice https://math.stackexchange.com/questions/38701/how-to-calculate-the-gradient-of-log-det-matrix-inverse
You can also compute the Fenchel dual so you can potentially use a primal-dual approach.

How to compute Discrete Fourier Transform?

I've been trying to find some places to help me better understand DFT and how to compute it but to no avail. So I need help understanding DFT and it's computation of complex numbers.
Basically, I'm just looking for examples on how to compute DFT with an explanation on how it was computed because in the end, I'm looking to create an algorithm to compute it.
I assume 1D DFT/IDFT ...
All DFT's use this formula:
X(k) is transformed sample value (complex domain)
x(n) is input data sample value (real or complex domain)
N is number of samples/values in your dataset
This whole thing is usually multiplied by normalization constant c. As you can see for single value you need N computations so for all samples it is O(N^2) which is slow.
Here mine Real<->Complex domain DFT/IDFT in C++ you can find also hints on how to compute 2D transform with 1D transforms and how to compute N-point DCT,IDCT by N-point DFT,IDFT there.
Fast algorithms
There are fast algorithms out there based on splitting this equation to odd and even parts of the sum separately (which gives 2x N/2 sums) which is also O(N) per single value, but the 2 halves are the same equations +/- some constant tweak. So one half can be computed from the first one directly. This leads to O(N/2) per single value. if you apply this recursively then you get O(log(N)) per single value. So the whole thing became O(N.log(N)) which is awesome but also adds this restrictions:
All DFFT's need the input dataset is of size equal to power of two !!!
So it can be recursively split. Zero padding to nearest bigger power of 2 is used for invalid dataset sizes (in audio tech sometimes even phase shift). Look here:
mine Complex->Complex domain DFT,DFFT in C++
some hints on constructing FFT like algorithms
Complex numbers
c = a + i*b
c is complex number
a is its real part (Re)
b is its imaginary part (Im)
i*i=-1 is imaginary unit
so the computation is like this
addition:
c0+c1=(a0+i.b0)+(a1+i.b1)=(a0+a1)+i.(b0+b1)
multiplication:
c0*c1=(a0+i.b0)*(a1+i.b1)
=a0.a1+i.a0.b1+i.b0.a1+i.i.b0.b1
=(a0.a1-b0.b1)+i.(a0.b1+b0.a1)
polar form
a = r.cos(θ)
b = r.sin(θ)
r = sqrt(a.a + b.b)
θ = atan2(b,a)
a+i.b = r|θ
sqrt
sqrt(r|θ) = (+/-)sqrt(r)|(θ/2)
sqrt(r.(cos(θ)+i.sin(θ))) = (+/-)sqrt(r).(cos(θ/2)+i.sin(θ/2))
real -> complex conversion:
complex = real+i.0
[notes]
do not forget that you need to convert data to different array (not in place)
normalization constant on FFT recursion is tricky (usually something like /=log2(N) depends also on the recursion stopping condition)
do not forget to stop the recursion if N=1 or 2 ...
beware FPU can overflow on big datasets (N is big)
here some insights to DFT/DFFT
here 2D FFT and wrapping example
usually Euler's formula is used to compute e^(i.x)=cos(x)+i.sin(x)
here How do I obtain the frequencies of each value in an FFT?
you find how to obtain the Niquist frequencies
[edit1] Also I strongly recommend to see this amazing video (I just found):
But what is the Fourier Transform A visual introduction
It describes the (D)FT in geometric representation. I would change some minor stuff in it but still its amazingly simple to understand.

Interpolation of function with accurate values for given points

I have a series of points representing values of a function, an example is below:
The values for X and Y can be real (non-integers). The function is monotonic, non-decreasing.
I want to be able to interpolate / assess the value of the function for any X (e.g. 1.5), so that a continuous function line would look like the following:
This is a standard interpolation problem, so I used Lagrange interpolation so far. It's quite simple and gives good enough results.
The problem with interpolation is that it also interpolates the values that are given as input, so the end results are for the input data will be different (e.g x=1, x=2)
Is there an algorithm that can guarantee that all the input values will have the same value after the interpolation? Linear interpolation is one solution, but it's linear the distances between X's don't have to be even (the graph is ugly then).
Please forgive my english / math language, I am not a native speaker.
The Lagrange interpolating polynomial in fact passes through all the n points, http://mathworld.wolfram.com/LagrangeInterpolatingPolynomial.html. Although, for the 1d problem, cubic splines is a preferred interpolator.
If you rather want to fit a model, e.g., a linear, quadratic, or a cubic polynomial, or another function, to your data than I think you could still put the constraints on the coefficients to ensure the approximating function passes through some selected points. Begin by choosing the model, and then solve the Least Squares fitting problem.

Difference between a linear problem and a non-linear problem? Essence of Dot-Product and Kernel trick

The kernel trick maps a non-linear problem into a linear problem.
My questions are:
1. What is the main difference between a linear and a non-linear problem? What is the intuition behind the difference of these two classes of problem? And How does kernel trick helps use the linear classifiers on a non-linear problem?
2. Why is the dot product so important in the two cases?
Thanks.
When people say linear problem with respect to a classification problem, they usually mean linearly separable problem. Linearly separable means that there is some function that can separate the two classes that is a linear combination of the input variable. For example, if you have two input variables, x1 and x2, there are some numbers theta1 and theta2 such that the function theta1.x1 + theta2.x2 will be sufficient to predict the output. In two dimensions this corresponds to a straight line, in 3D it becomes a plane and in higher dimensional spaces it becomes a hyperplane.
You can get some kind of intuition about these concepts by thinking about points and lines in 2D/3D. Here's a very contrived pair of examples...
This is a plot of a linearly inseparable problem. There is no straight line that can separate the red and blue points.
However, if we give each point an extra coordinate (specifically 1 - sqrt(x*x + y*y)... I told you it was contrived), then the problem becomes linearly separable since the red and blue points can be separated by a 2-dimensional plane going through z=0.
Hopefully, these examples demonstrate part of the idea behind the kernel trick:
Mapping a problem into a space with a larger number of dimensions makes it more likely that the problem will become linearly separable.
The second idea behind the kernel trick (and the reason why it is so tricky) is that it is usually very awkward and computationally expensive to work in a very high-dimensional space. However, if an algorithm only uses the dot products between points (which you can think of as distances), then you only have to work with a matrix of scalars. You can implicitly perform the calculations in the higher-dimensional space without ever actually having to do the mapping or handle the higher-dimensional data.
Many classifiers, among them the linear Support Vector Machine (SVM), can only solve problems that are linearly separable, i.e. where the points belonging to class 1 can be separated from the points belonging to class 2 by a hyperplane.
In many cases, a problem that is not linearly separable can be solved by applying a transform phi() to the data points; this transform is said to transform the points to feature space. The hope is that, in feature space, the points will be linearly separable. (Note: This is not the kernel trick yet... stay tuned.)
It can be shown that, the higher the dimension of the feature space, the greater the number of problems that are linearly separable in that space. Therefore, one would ideally want the feature space to be as high-dimensional as possible.
Unfortunately, as the dimension of feature space increases, so does the amount of computation required. This is where the kernel trick comes in. Many machine learning algorithms (among them the SVM) can be formulated in such a way that the only operation they perform on the data points is a scalar product between two data points. (I will denote a scalar product between x1 and x2 by <x1, x2>.)
If we transform our points to feature space, the scalar product now looks like this:
<phi(x1), phi(x2)>
The key insight is that there exists a class of functions called kernels that can be used to optimize the computation of this scalar product. A kernel is a function K(x1, x2) that has the property that
K(x1, x2) = <phi(x1), phi(x2)>
for some function phi(). In other words: We can evaluate the scalar product in the low-dimensional data space (where x1 and x2 "live") without having to transform to the high-dimensional feature space (where phi(x1) and phi(x2) "live") -- but we still get the benefits of transforming to the high-dimensional feature space. This is called the kernel trick.
Many popular kernels, such as the Gaussian kernel, actually correspond to a transform phi() that transforms into an infinte-dimensional feature space. The kernel trick allows us to compute scalar products in this space without having to represent points in this space explicitly (which, obviously, is impossible on computers with finite amounts of memory).
The main difference (for practical purposes) is: A linear problem either does have a solution (and then it's easily found), or you get a definite answer that there is no solution at all. You do know this much, before you even know the problem at all. As long as it's linear, you'll get an answer; quickly.
The intuition beheind this is the fact that if you have two straight lines in some space, it's pretty easy to see whether they intersect or not, and if they do, it's easy to know where.
If the problem is not linear -- well, it can be anything, and you know just about nothing.
The dot product of two vectors just means the following: The sum of the products of the corresponding elements. So if your problem is
c1 * x1 + c2 * x2 + c3 * x3 = 0
(where you usually know the coefficients c, and you're looking for the variables x), the left hand side is the dot product of the vectors (c1,c2,c3) and (x1,x2,x3).
The above equation is (pretty much) the very defintion of a linear problem, so there's your connection between the dot product and linear problems.
Linear equations are homogeneous, and superposition applies. You can create solutions using combinations of other known solutions; this is one reason why Fourier transforms work so well. Non-linear equations are not homogeneous, and superposition does not apply. Non-linear equations usually have to be solved numerically using iterative, incremental techniques.
I'm not sure how to express the importance of the dot product, but it does take two vectors and returns a scalar. Certainly a solution to a scalar equation is less work than solving a vector or higher-order tensor equation, simply because there are fewer components to deal with.
My intuition in this matter is based more on physics, so I'm having a hard time translating to AI.
I think following link also useful ...
http://www.simafore.com/blog/bid/113227/How-support-vector-machines-use-kernel-functions-to-classify-data

What's a good weighting function?

I'm trying to perform some calculations on a non-directed, cyclic, weighted graph, and I'm looking for a good function to calculate an aggregate weight.
Each edge has a distance value in the range [1,∞). The algorithm should give greater importance to lower distances (it should be monotonically decreasing), and it should assign the value 0 for the distance ∞.
My first instinct was simply 1/d, which meets both of those requirements. (Well, technically 1/∞ is undefined, but programmers tend to let that one slide more easily than do mathematicians.) The problem with 1/d is that the function cares a lot more about the difference between 1/1 and 1/2 than the difference between 1/34 and 1/35. I'd like to even that out a bit more. I could use √(1/d) or ∛(1/d) or even ∜(1/d), but I feel like I'm missing out on a whole class of possibilities. Any suggestions?
(I thought of ln(1/d), but that goes to -∞ as d goes to ∞, and I can't think of a good way to push that up to 0.)
Later:
I forgot a requirement: w(1) must be 1. (This doesn't invalidate the existing answers; a multiplicative constant is fine.)
perhaps:
exp(-d)
edit: something along the lines of
exp(k(1-d)), k real
will fit your extra requirement (I'm sure you knew that but what the hey).
How about 1/ln (d + k)?
Some of the above answers are versions of a Gaussian distribution which I agree is a good choice. The Gaussian or normal distribution can be found often in nature. It is a B-Spline basis function of order-infinity.
One drawback to using it as a blending function is its infinite support requires more calculations than a finite blending function. A blend is found as a summation of product series. In practice the summation may stop when the next term is less than a tolerance.
If possible form a static table to hold discrete Gaussian function values since calculating the values is computationally expensive. Interpolate table values if needed.
How about this?
w(d) = (1 + k)/(d + k) for some large k
d = 2 + k would be the place where w(d) = 1/2
It seems you are in effect looking for a linear decrease, something along the lines of infinity - d. Obviously this solution is garbage, but since you are probably not using a arbitrary precision data type for the distance, you could use yourDatatype.MaxValue - d to get a linear decreasing function for this.
In fact you might consider using (yourDatatype.MaxValue - d) + 1 you are using doubles, because you could then assign the weight of 0 if your distance is "infinity" (since doubles actually have a value for that.)
Of course you still have to consider implementation details like w(d) = double.infinity or w(d) = integer.MaxValue, but these should be easy to spot if you know the actual data types you are using ;)

Resources