Smoothing values over time: moving average or something better? - algorithm

I'm coding something at the moment where I'm taking a bunch of values over time from a hardware compass. This compass is very accurate and updates very often, with the result that if it jiggles slightly, I end up with the odd value that's wildly inconsistent with its neighbours. I want to smooth those values out.
Having done some reading around, it would appear that what I want is a high-pass filter, a low-pass filter or a moving average. Moving average I can get down with, just keep a history of the last 5 values or whatever, and use the average of those values downstream in my code where I was once just using the most recent value.
That should, I think, smooth out those jiggles nicely, but it strikes me that it's probably quite inefficient, and this is probably one of those Known Problems to Proper Programmers to which there's a really neat Clever Math solution.
I am, however, one of those awful self-taught programmers without a shred of formal education in anything even vaguely related to CompSci or Math. Reading around a bit suggests that this may be a high or low pass filter, but I can't find anything that explains in terms comprehensible to a hack like me what the effect of these algorithms would be on an array of values, let alone how the math works. The answer given here, for instance, technically does answer my question, but only in terms comprehensible to those who would probably already know how to solve the problem.
It would be a very lovely and clever person indeed who could explain the sort of problem this is, and how the solutions work, in terms understandable to an Arts graduate.

If you are trying to remove the occasional odd value, a low-pass filter is the best of the three options that you have identified. Low-pass filters allow low-speed changes such as the ones caused by rotating a compass by hand, while rejecting high-speed changes such as the ones caused by bumps on the road, for example.
A moving average will probably not be sufficient, since the effects of a single "blip" in your data will affect several subsequent values, depending on the size of your moving average window.
If the odd values are easily detected, you may even be better off with a glitch-removal algorithm that completely ignores them:
if (abs(thisValue - averageOfLast10Values) > someThreshold)
{
thisValue = averageOfLast10Values;
}
Here is a guick graph to illustrate:
The first graph is the input signal, with one unpleasant glitch. The second graph shows the effect of a 10-sample moving average. The final graph is a combination of the 10-sample average and the simple glitch detection algorithm shown above. When the glitch is detected, the 10-sample average is used instead of the actual value.

If your moving average has to be long in order to achieve the required smoothing, and you don't really need any particular shape of kernel, then you're better off if you use an exponentially decaying moving average:
a(i+1) = tiny*data(i+1) + (1.0-tiny)*a(i)
where you choose tiny to be an appropriate constant (e.g. if you choose tiny = 1- 1/N, it will have the same amount of averaging as a window of size N, but distributed differently over older points).
Anyway, since the next value of the moving average depends only on the previous one and your data, you don't have to keep a queue or anything. And you can think of this as doing something like, "Well, I've got a new point, but I don't really trust it, so I'm going to keep 80% of my old estimate of the measurement, and only trust this new data point 20%". That's pretty much the same as saying, "Well, I only trust this new point 20%, and I'll use 4 other points that I trust the same amount", except that instead of explicitly taking the 4 other points, you're assuming that the averaging you did last time was sensible so you can use your previous work.

Moving average I can get down with ...
but it strikes me that it's probably
quite inefficient.
There's really no reason a moving average should be inefficient. You keep the number of data points you want in some buffer (like a circular queue). On each new data point, you pop the oldest value and subtract it from a sum, and push the newest and add it to the sum. So every new data point really only entails a pop/push, an addition and a subtraction. Your moving average is always this shifting sum divided by the number of values in your buffer.
It gets a little trickier if you're receiving data concurrently from multiple threads, but since your data is coming from a hardware device that seems highly doubtful to me.
Oh and also: awful self-taught programmers unite! ;)

An exponentially decaying moving average can be calculated "by hand" with only the trend if you use the proper values. See http://www.fourmilab.ch/hackdiet/e4/ for an idea on how to do this quickly with a pen and paper if you are looking for “exponentially smoothed moving average with 10% smoothing”. But since you have a computer, you probably want to be doing binary shifting as opposed to decimal shifting ;)
This way, all you need is a variable for your current value and one for the average. The next average can then be calculated from that.

there's a technique called a range gate that works well with low-occurrence spurious samples. assuming the use of one of the filter techniques mentioned above (moving average, exponential), once you have "sufficient" history (one Time Constant) you can test the new, incoming data sample for reasonableness, before it is added to the computation.
some knowledge of the maximum reasonable rate-of-change of the signal is required. the raw sample is compared to the most recent smoothed value, and if the absolute value of that difference is greater than the allowed range, that sample is thrown out (or replaced with some heuristic, eg. a prediction based on slope; differential or the "trend" prediction value from double exponential smoothing)

Related

Multiple parameter optimization with a stochastic element

I am looking for a method to find the best parameters for a simulation. It's about break-shots in billiards / pool. A shot is defined by 7 parameters, I can simulate the shot and then rate the outcome and I would like to compute the best parameters.
I have found the following link here:
Multiple parameter optimization with lots of local minima
suggesting 4 kinds of algorithms. In the pool simulator I am using, the shots are altered by a little random value each time it is simulated. If I simulate the same shot twice, the outcome will be different. So I am looking for an algorithm like the ones in the link above, only with the addition of a stochastical element, optimizing for the 7 parameters that will on average yield the best parameters, i.e. a break shot that most likely will be a success. My initial idea was simulating the shot 100 or 1000 times and just take the average as rating for the algorithms above, but I still feel like there is a better way. Does anyone have an idea?
The 7 parameters are continuous but within different ranges (one from 0 to 10, another from 0.0 to 0.028575 and so on).
Thank you
At least for some of the algorithms, simulating the same shot repeatedly might not be neccessary. As long as your alternatives have some form of momentum, like in the swarm simulation approach, you can let that be affected by the outcome of each individual simulation. In that case, a single unlucky simulation would slow the movement in parameter space only slightly, whereas a serious loss of quality should be enough to stop and reverse the movement. Thos algorithms which don't use momentum might be tweaked to have momentum. If not, then repeated simulation seems the best approach. Unless you can get your hands on the internals of the simulator, and rate the shot as a whole without having to simulate it over and over again.
You can use the algorithms you mentioned in your non-deterministic scenario with independent stochastic runs. Your idea with repeated simulations is good, you can read more about how many repeats you might have to consider for your simulations (unfortunately, there is no trivial answer). If you are not so much into maths, and the runs go fast, do 1.000 repeats, then 10.000 repeats, and see if the results differ largely. If yes, you have to collect more samples, if not, you are probably on the safe side (the central limit theorem states that the results converge).
Further, do not just consider the average! Make sure to look into the standard deviation for each algorithm's results; you might want to use box plots to compare their quartiles. If you rely on the average only, you could pick an algorithm that produces very varying results, sometimes excellent, sometimes terrible in performance.
I don't know what language you are using, but if you use Java, I am maintaining a tool that could simplify your "monte carlo" style experiments.

robust online algorithm for semi-variance

I'm looking for the equivalent of welford's algorithm for the online computation semi-variance (downside partial variance). Does anyone know of a good reference? Does such an algorithm even exist?
Edit: the case where the semi-variance is taken relative to a fixed target is trivial. the problem is calculating the semi-variance in relation to the mean
I believe the answer is one does not exist and I'm going to try to outline a proof of why this is so.
Consider a 'uesful' online algorithm to be defined by two criteria:
It must have fixed memory requirements during processing.
Each update should take a fixed amount of time.
This is stricter than the literal definition of an sequential/incremental/online algorithm which really just requires that data can be passed in one piece at a time. However, consider that if either 1) or 2) were not true then after processing a large enough amounts of elements, the memory required or time required to run the algorithm would eventually become infeasible. Usually, one of the reasons why online algorithms are used is that they can be used continuously without fear of the performance slowly getting worse. Also, note that there are online algorithms for calculating the mean and variance that satisfy both 1 & 2 and I think that's what we are aiming to achieve.
Now to the problem posed. During processing, the mean will change with every bit of new data. That in turn means the set of observations that fall below the mean will change. When this happens, we need to adjust our running semi-variance according to the set "delta", defined as the elements that are not in the union between the set of elements below the old mean and the set of elements below the new mean. We will have to calculate this delta in the process of adjusting the old-semivariance to the new-semivariance in the presence of new data.
Now let's consider the complexity of calculating this set delta. We will need to find all elements that fall between the old mean and the new mean. We will always keep track of the old mean, while the new mean can be calculated incrementally in fixed time so they pose no problem. However to calculate the delta itself, there is no way to do it other than requiring us to keep track of all the previous elements in our set. This immediately breaks the memory condition of an online algorithm. Secondly, even if we keep the previous elements in our set sorted, the best speed we can achieve to find those that are between the old mean and new mean is O(log(number of elements)), which is worse than fixed. So eventually, with enough elements, the online algorithm will not only require more memory than we have, but it will also require more time.
http://www3.sympatico.ca/jean-v.cote/computation_of_semi-variance.pdf
P.S.:This is not an incremental computation. I have another idea. I will keep you posted.

Multiple parameter optimization with lots of local minima

I'm looking for algorithms to find a "best" set of parameter values. The function in question has a lot of local minima and changes very quickly. To make matters even worse, testing a set of parameters is very slow - on the order of 1 minute - and I can't compute the gradient directly.
Are there any well-known algorithms for this kind of optimization?
I've had moderate success with just trying random values. I'm wondering if I can improve the performance by making the random parameter chooser have a lower chance of picking parameters close to ones that had produced bad results in the past. Is there a name for this approach so that I can search for specific advice?
More info:
Parameters are continuous
There are on the order of 5-10 parameters. Certainly not more than 10.
How many parameters are there -- eg, how many dimensions in the search space? Are they continuous or discrete - eg, real numbers, or integers, or just a few possible values?
Approaches that I've seen used for these kind of problems have a similar overall structure - take a large number of sample points, and adjust them all towards regions that have "good" answers somehow. Since you have a lot of points, their relative differences serve as a makeshift gradient.
Simulated
Annealing: The classic approach. Take a bunch of points, probabalistically move some to a neighbouring point chosen at at random depending on how much better it is.
Particle
Swarm Optimization: Take a "swarm" of particles with velocities in the search space, probabalistically randomly move a particle; if it's an improvement, let the whole swarm know.
Genetic Algorithms: This is a little different. Rather than using the neighbours information like above, you take the best results each time and "cross-breed" them hoping to get the best characteristics of each.
The wikipedia links have pseudocode for the first two; GA methods have so much variety that it's hard to list just one algorithm, but you can follow links from there. Note that there are implementations for all of the above out there that you can use or take as a starting point.
Note that all of these -- and really any approach to this large-dimensional search algorithm - are heuristics, which mean they have parameters which have to be tuned to your particular problem. Which can be tedious.
By the way, the fact that the function evaluation is so expensive can be made to work for you a bit; since all the above methods involve lots of independant function evaluations, that piece of the algorithm can be trivially parallelized with OpenMP or something similar to make use of as many cores as you have on your machine.
Your situation seems to be similar to that of the poster of Software to Tune/Calibrate Properties for Heuristic Algorithms, and I would give you the same advice I gave there: consider a Metropolis-Hastings like approach with multiple walkers and a simulated annealing of the step sizes.
The difficulty in using a Monte Carlo methods in your case is the expensive evaluation of each candidate. How expensive, compared to the time you have at hand? If you need a good answer in a few minutes this isn't going to be fast enough. If you can leave it running over night, it'll work reasonably well.
Given a complicated search space, I'd recommend a random initial distributed. You final answer may simply be the best individual result recorded during the whole run, or the mean position of the walker with the best result.
Don't be put off that I was discussing maximizing there and you want to minimize: the figure of merit can be negated or inverted.
I've tried Simulated Annealing and Particle Swarm Optimization. (As a reminder, I couldn't use gradient descent because the gradient cannot be computed).
I've also tried an algorithm that does the following:
Pick a random point and a random direction
Evaluate the function
Keep moving along the random direction for as long as the result keeps improving, speeding up on every successful iteration.
When the result stops improving, step back and instead attempt to move into an orthogonal direction by the same distance.
This "orthogonal direction" was generated by creating a random orthogonal matrix (adapted this code) with the necessary number of dimensions.
If moving in the orthogonal direction improved the result, the algorithm just continued with that direction. If none of the directions improved the result, the jump distance was halved and a new set of orthogonal directions would be attempted. Eventually the algorithm concluded it must be in a local minimum, remembered it and restarted the whole lot at a new random point.
This approach performed considerably better than Simulated Annealing and Particle Swarm: it required fewer evaluations of the (very slow) function to achieve a result of the same quality.
Of course my implementations of S.A. and P.S.O. could well be flawed - these are tricky algorithms with a lot of room for tweaking parameters. But I just thought I'd mention what ended up working best for me.
I can't really help you with finding an algorithm for your specific problem.
However in regards to the random choosing of parameters I think what you are looking for are genetic algorithms. Genetic algorithms are generally based on choosing some random input, selecting those, which are the best fit (so far) for the problem, and randomly mutating/combining them to generate a next generation for which again the best are selected.
If the function is more or less continous (that is small mutations of good inputs generally won't generate bad inputs (small being a somewhat generic)), this would work reasonably well for your problem.
There is no generalized way to answer your question. There are lots of books/papers on the subject matter, but you'll have to choose your path according to your needs, which are not clearly spoken here.
Some things to know, however - 1min/test is way too much for any algorithm to handle. I guess that in your case, you must really do one of the following:
get 100 computers to cut your parameter testing time to some reasonable time
really try to work out your parameters by hand and mind. There must be some redundancy and at least some sanity check so you can test your case in <1min
for possible result sets, try to figure out some 'operations' that modify it slightly instead of just randomizing it. For example, in TSP some basic operator is lambda, that swaps two nodes and thus creates new route. Your can be shifting some number up/down for some value.
then, find yourself some nice algorithm, your starting point can be somewhere here. The book is invaluable resource for anyone who starts with problem-solving.

algorithm to calculate ETA of a file downloading session [duplicate]

We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'
More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.
Of course you could have a different g as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.
I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?
Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.
Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.
I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.

Estimating/forecasting download completion time

We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'
More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.
Of course you could have a different g as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.
I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?
Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.
Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.
I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.

Resources