strategy for fitting angle data - outliers

I have a set of angles. The distribution could be roughly described as:
there a are usually several values very close (0.0-1.0 degree apart) to the correct solution
there are also noisy values being very far from the correct result, even opposite direction
is there a common solution/strategy for such a problem?
For multidimensional data, I would use RANSAC - but I have the impression that it is unusual to apply Ransac on 1-dimensional data. Another problem is computing the mean of an angle. I read some other posts about how to calculate the mean of angles by using vectors, but I just wonder if there isn't a particular fitting solution which deals with both issues already.

You can use RANSAC even in this case, all the necessary conditions (minimal samples, error of a data point, consensus set) are met. Your minimal sample will be 1 point, a randomly picked angle (although you can try all of them, might be fast enough). Then, all the angles (data points) with error (you can use just absolute distance, modulo 360) less than some threshold (e.g. 1 degree), will be considered as inliers, i.e. within the consensus set.
If you want to play with it a bit more, you can make the results more stable by adding some local optimisation, see e.g.:
Lebeda, Matas, Chum: Fixing the Locally Optimized RANSAC, BMVC 2012.
You could try another approaches, e.g. median, or fitting a mixture of Gaussian and uniform distribution, but you would have to deal with the periodicity of the signal somehow, so I guess RANSAC should be your choice.

Related

Determining error between two surfaces given same discrete inputs?

I have, as an output of a machine learning algorithm, a surface in z, which has known increments along x and y. These points along x and y match exactly to a surface which I am comparing the output of my algorithm against in order to get a metric of fit, or error. I have been struggling to find an optimal way of calculating this, and can't find any good resources on different options that I have. I have tried simple pointwise subtraction of the surfaces, which I take the absolute value and summation of, and I have tried squared versions of this, as well as divided versions, but each of these encounters different problems. I was wondering if any of you knew of any good resources on different options and which of these work in different situations. Thanks!
If your problem is outliers, compute all absolute height differences and discard the N/2 largest (or another fraction, depending on the usual proportion of outliers). Then take the average of the remaining ones (or the RMS). This is called a trimmed average.

Finding optimal solution to multivariable function with non-negligible solution time?

So I have this issue where I have to find the best distribution that, when passed through a function, matches a known surface. I have written a script that creates the distribution given some parameters and spits out a metric that compares the given surface to the known, but this script takes a non-negligible time, so I can't just run through a very large set of parameters to find the optimal set of parameters. I looked into the simplex method, and it seems to be the right path, but its not quite what I need, because I dont exactly have a set of linear equations, and dont know the constraints for the parameters, but rather one method that gives a single output (an thats all). Can anyone point me in the right direction to how to solve this problem? Thanks!
To quickly go over my process / problem again, I have a set of parameters (at this point 2 but will be expanded to more later) that defines a distribution. This distribution is used to create a surface, which is compared to a known surface, and an error metric is produced. I want to find the optimal set of parameters, but cannot run through an arbitrarily large number of parameters due to the time constraint.
One situation consistent with what you have asked is a model in which you have a reasonably tractable probability distribution which generates an unknown value. This unknown value goes through a complex and not mathematically nice process and generates an observation. Your surface corresponds to the observed probability distribution on the observations. You would be happy finding the parameters that give a good least squares fit between the theoretical and real life surface distribution.
One approximation for the fitting process is that you compute a grid of values in the space output by the probability distribution. Each set of parameters gives you a probability for each point on this grid. The not nice process maps each grid point here to a nearest grid point in the space of the surface. The least squares fit is a quadratic in the probabilities calculated for the first grid, because the probabilities calculated for a grid point in the surface are the sums of the probabilities calculated for values in the first grid that map to something nearer to that point in the surface than any other point in the surface. This means that it has first (and even second) derivatives that you can calculate. If your probability distribution is nice enough you can use the chain rule to calculate derivatives for the least squares fit in the initial parameters. This means that you can use optimization methods to calculate the best fit parameters which require not just a means to calculate the function to be optimized but also its derivatives, and these are generally more efficient than optimization methods which require only function values, such as Nelder-Mead or Torczon Simplex. See e.g. http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/optim/package-summary.html.
Another possible approach is via something called the EM Algorithm. Here EM stands for Expectation-Maximization. It can be used for finding maximum likelihood fits in cases where the problem would be easy if you could see some hidden state that you cannot actually see. In this case the output produced by the initial distribution might be such a hidden state. One starting point is http://www-prima.imag.fr/jlc/Courses/2002/ENSI2.RNRF/EM-tutorial.pdf.

Theory on how to find the equation of a curve given a variable number of data points

I have recently started working on a project. One of the problems I ran into was converting changing accelerations into velocity. Accelerations at different points in time are provided through sensors. If you get the equation of these data points, the derivative of a certain time (x) on that equation will be the velocity.
I know how to do this on the computer, but how would I get the equation to start with? I have searched around but I have not found any existing programs that can form an equation given a set of points. In the past, I have created a neural net algorithm to form an equation, but it takes an incredibly long time to run.
If someone can link me a program or explain the process of doing this, that would be fantastic.
Sorry if this is in the wrong forum. I would post into math, but a programming background will be needed to know the realm of possibility of what a computer can do quickly.
This started out as a comment but ended up being too big.
Just to make sure you're familiar with the terminology...
Differentiation takes a function f(t) and spits out a new function f'(t) that tells you how f(t) changes with time (i.e. f'(t) gives the slope of f(t) at time t). This takes you from displacement to velocity or from velocity to acceleration.
Integreation takes a function f(t) and spits out a new function F(t) which measures the area under the function f(t) from the beginning of time up until a given point t. What's not obvious at first is that integration is actually the reverse of differentiation, a fact called the The Fundamental Theorem of Calculus. So integration takes you from acceleration to velocity or velocity to displacement.
You don't need to understand the rules of calculus to do numerical integration. The simplest (and most naive) method for integrating a function numerically is just by approximating the area by dividing it up into small slices between time points and summing the area of rectangles. This approximating sum is called a Reimann sum.
As you can see, this tends to really overshoot and undershoot certain parts of the function. A more accurate but still very simple method is the trapezoid rule, which also approximates the function with a series of slices, except the tops of the slices are straight lines between the function values rather than constant values.
Still more complicated, but yet a better approximation, is Simpson's rules, which approximates the function with parabolas between time points.
(source: tutorvista.com)
You can think of each of these methods as getting a better approximation of the integral because they each use more information about the function. The first method uses just one data point per area (a constant flat line), the second method uses two data points per area (a straight line), and the third method uses three data points per area (a parabola).
You could read up on the math behind these methods here or in the first page of this pdf.
I agree with the comments that numerical integration is probably what you want. In case you still want a function going through your data, let me further argue against doing that.
It's usually a bad idea to find a curve that goes exactly through some given points. In almost any applied math context you have to accept that there is a little noise in the inputs, and a curve going exactly through the points may be very sensitive to noise. This can produce garbage outputs. Finding a curve going exactly through a set of points is asking for overfitting to get a function that memorizes rather than understands the data, and does not generalize.
For example, take the points (0,0), (1,1), (2,4), (3,9), (4,16), (5,25), (6,36). These are seven points on y=x^2, which is fine. The value of x^2 at x=-1 is 1. Now what happens if you replace (3,9) with (2.9,9.1)? There is a sixth order polynomial passing through all 7 points,
4.66329x - 8.87063x^2 + 7.2281x^3 - 2.35108x^4 + 0.349747x^5 - 0.0194304x^6.
The value of this at x=-1 is -23.4823, very far from 1. While the curve looks ok between 0 and 2, in other examples you can see large oscillations between the data points.
Once you accept that you want an approximation, not a curve going exactly through the points, you have what is known as a regression problem. There are many types of regression. Typically, you choose a set of functions and a way to measure how well a function approximates the data. If you use a simple set of functions like lines (linear regression), you just find the best fit. If you use a more complicated family of functions, you should use regularization to penalize overly complicated functions such as high degree polynomials with large coefficients that memorize the data. If you either use a simple family or regularization, the function tends not to change much when you add or withhold a few data points, which indicates that it is a meaningful trend in the data.
Unfortunately, integrating accelerometer data to get velocity is a numerically unstable problem. For most applications, your error will diverge far too soon to get results of any practical value.
Recall that:
So:
However well you fit a function to your accelerometer data, you will still essentially be doing a piecewise interpolation of the underlying acceleration function:
Where the error terms from each integration will add!
Typically you will see wildly inaccurate results after just a few seconds.

why overfitting gives a bad hypothesis function

In linear or logistic regression if we find a hypothesis function which fits the training set perfectly then it should be a good thing because in that case we have used 100 % of the information given to predict new information.
While it is called to be overfitting and said to be bad thing.
By making the hypothesis function simpler we may be actually increasing the noise instead of decreasing it.
Why is it so?
Overfitting occurs when you try "too hard" to make the examples in the training set fit the classification rule.
It is considered bad thing for 2 reasons main reasons:
The data might have noise. Trying too hard to classify 100% of the examples correctly, will make the noise count, and give you a bad rule while ignoring this noise - would usually be much better.
Remember that the classified training set is just a sample of the real data. This solution is usually more complex than what you would have got if you tolerated a few wrongly classified samples. According to Occam's Razor, you should prefer the simpler solution, so ignoring some of the samples, will be better,
Example:
According to Occam's razor, you should tolerate the misclassified sample, and assume it is noise or insignificant, and adopt the simple solution (green line) in this data set:
Because you actually didn't "learn" anything from your training set, you've just fitted to your data.
Imagine, you have a one-dimensional regression
x_1 -> y_1
...
x_n -> y_1
The function, defined this way
y_n, if x = x_n
f(x)=
0, otherwise
will give you perfect fit, but it's actually useless.
Hope, this helped a bit:)
Assuming that your regression accounts for all source of deviation in your data, then you might argue that your regression perfectly fits the data. However, if you know all (and I mean all) of the influences in your system, then you probably don't need a regression. You likely have an analytic solution that perfectly predicts new information.
In actuality, the information you possess will fall short of this perfect level. Noise (measurement error, partial observability, etc) will cause deviation in your data. In response, a regression (or other fitting mechanism) should seek the general trend of the data while minimizing the influence of noise.
Actually, the statement is not quite correct as written. It is perfectly fine to match 100% of your data if your hypothesis function is linear. Every continuous nonlinear function may be approximated locally by a linear function which gives important information on it's local behavior.
It is also fine to match 100 points of data to a quadratic curve if that data matches 100%. You can have high confidence that you are not overfitting your data, since the data consistently shows quadratic behavior.
However, one can always get 100% fit by using a polynomial function of high enough degree. Even without the noise that others have pointed out, though, you shouldn't assume your data has some high degree polynomial behavior without having some kind of theoretical or experimental confirmation of that hypothesis. Two good indicators that polynomial behavior is indicated are:
You have some theoretical reason for expecting the data to grow as x^n in one of the directional limits.
You have data that has been supporting a fixed degree polynomial fit as more and more data has been collected.
Notice, though, that even though exponential and reciprocal relationships may have data that fits a polynomial of high enough degree, they don't tend to obey eith of the two conditions above.
The point is that your data fit needs to be useful to prediction. You always know that a linear fit will give information locally, but that information becomes more useful the more points are fit. Even if there are only two points and noise, a linear fit still gives the best theoretical look at the data collected so far, and establishes the first expectations of the data. Beyond that, though, using a quadratic fit for three points or a cubic fit for four is not validly giving more information, as it assumes both local and asymptotic behavior information with the addition of one point. You need justification for your hypothesis function. That justification can come from more points or from theory.
(A third reason that sometimes comes up is
You have theoretical and experimental reason to believe that error and noise do not contribute more than some bounds, and you can take a polynomial hypothesis to look at local derivatives and the behavior needed to match the data.
This is typically used in understanding data to build theoretical models without having a good starting point for theory. You should still strive to use the smallest polynomial degree possible, and look to substitute out patterns in the coefficients with what they may indicate (reciprocal, exponential, gaussian, etc.) in infinite series.)
Try imagining it this way. You have a function from which you pick n different values to represent a sample / training set:
y(n) = x(n), n is element of [0, 1]
But, since you want to build a robust model, you want to add a little noise to your training set, so you actually add a little noise when generating the data:
data(n) = y(n) + noise(n) = x(n) + u(n)
where by u(n) I marked a uniform random noise with a mean 0 and standard deviation 1: U(0,1). Quite simply, it's a noise signal which is most probable to take an value 0, and less likely to take a value farther it is from 0.
And then you draw, let's say, 10 points to be your training set. If there was no noise, they would all be lying on a line y = x. Since there was noise, the lowest degree of polynomial function that can represent them is probably of 10-th order, a function like: y = a_10 * x^10 + a_9 * x^9 + ... + a_1 * x + a_0.
If you consider, by just using an estimation of the information from the training set, you would probably get a simpler function than the 10-th order polynomial function, and it would have been closer to the real function.
Consider further that your real function can have values outside the [0, 1] interval but for some reason the samples for the training set could only be collected from this interval. Now, a simple estimation would probably act significantly better outside the interval of the training set, while if we were to fit the training set perfectly, we would get an overfitted function that meandered with lots of ups and downs all over :)
Overfitting is termed as bad due to the bais it has to the true solution. The solution which is overfit is 100% fitting to the training data which is used but with any small data point addition the model will change drastically. This is called variance of the model. Hence the bais-variance tradeoff where we try to have a balance between both the factors so that, the model does not change drastically on small data changes but also reasonably properly predicts the output.

How to 'smooth' data and calculate line gradient?

I'm reading data from a device which measures distance. My sample rate is high so that I can measure large changes in distance (i.e. velocity) but this means that, when the velocity is low, the device delivers a number of measurements which are identical (due to the granularity of the device). This results in a 'stepped' curve.
What I need to do is to smooth the curve in order to calculate the velocity. Following that I then need to calculate the acceleration.
How to best go about this?
(Sample rate up to 1000Hz, calculation rate of 10Hz would be ok. Using C# in VS2005)
The wikipedia entry from moogs is a good starting point for smoothing the data. But it does not help you in making a decision.
It all depends on your data, and the needed processing speed.
Moving Average
Will flatten the top values. If you are interrested in the minimum and maximum value, don't use this. Also I think using the moving average will influence your measurement of the acceleration, since it will flatten your data (a bit), thereby acceleration will appear to be smaller. It all comes down to the needed accuracy.
Savitzky–Golay
Fast algorithm. As fast as the moving average. That will preserve the heights of peaks. Somewhat harder to implement. And you need the correct coefficients. I would pick this one.
Kalman filters
If you know the distribution, this can give you good results (it is used in GPS navigation systems). Maybe somewhat harder to implement. I mention this because I have used them in the past. But they are probably not a good choice for a starter in this kind of stuff.
The above will reduce noise on your signal.
Next you have to do is detect the start and end point of the "acceleration". You could do this by creating a Derivative of the original signal. The point(s) where the derivative crosses the Y-axis (zero) are probably the peaks in your signal, and might indicate the start and end of the acceleration.
You can then create a second degree derivative to get the minium and maximum acceleration itself.
You need a smoothing filter, the simplest would be a "moving average": just calculate the average of the last n points.
The question here is, how to determine n, can you tell us more about your application?
(There are other, more complicated filters. They vary on how they preserve the input data. A good list is in Wikipedia)
Edit!: For 10Hz, average the last 100 values.
Moving averages are generally terrible - but work well for white noise. Both moving averages & Savitzky-Golay both boil down to a correlation - and therefore are very fast and could be implemented in real time. If you need higher order information like first and second derivatives - SG is a good right choice. The magic of SG lies in the constant correlation coefficients needed for the filter - once you have decided the length and degree of polynomial to fit locally, the coefficients need only to be found once. You can compute them using R (sgolay) or Matlab.
You can also estimate a noisy signal's first derivative via the Savitzky-Golay best-fit polynomials - these are sometimes called Savitzky-Golay derivatives - and typically give a good estimate of the first derivative.
Kalman filtering can be very effective, but it's heavier computationally - it's hard to beat a short convolution for speed!
Paul
CenterSpace Software
In addition to the above articles, have a look at Catmull-Rom Splines.
You could use a moving average to smooth out the data.
In addition to GvSs excellent answer above you could also consider smoothing / reducing the stepping effect of your averaged results using some general curve fitting such as cubic or quadratic splines.

Resources