How to effeciently compute the first, second, and third derivatives of live updating data? - algorithm

I have a running/decaying sum that updates over time with live data. I would like to efficiently compute the first, second, and third derivatives.
The simplest way I can think of doing this is to calculate deltas over some time difference in the running/decaying sum. e.g.
t_0 sum_0
t_1 sum_1
first_derivative = (sum_1 - sum_0) / (t_1 - t0)
I can continue this process further with the second and third derivatives, which I think should work, but I'm not sure if this is the best way.
This running/decaying sum is not a defined function and relies on live updating data, so I can't just do a normal derivative.

I don't know what your real use case is, but it sounds like you're going about this the wrong way. For most cases I can imagine, what you really want to do is:
First determine the continuous signal that your time series represents; and then
You can exactly calculate the derivatives of this signal at any point.
Since you have already decided that your time series represents exponential decay with discontinuous jumps, you have decided that all your derivatives are simply proportional to the current value and provide no extra information.
This probably isn't what you really want.
You would probably be better off applying a more sophisticated low-pass filter to your samples. In situations like yours, where you receive intermittent updates, it can be convenient to design the impulse response as a weighted sum of exponential decays with different (and possibly complex) time scales.
If you use 4 or 5 exponentials, then you can ensure that the value and first 3 derivatives of the impulse response are all smooth, so none of the derivatives you have to report are discontinuous.
The impulse response of any all-pole IIR filter can be written as the sum of exponentials in this way, though "partial fraction decomposition", but I guess there is a lot of learning between you and there right now. Those terms are all Googlable.
An example impulse response that would be smoother than an exponential decay, is this one, that's 0 in the first 3 derivatives:
5( e-t - 4e-2t + 6e-3t - 4e-4t + e-5t )
You can scale the decay times however you like. It looks like this (from Wolfram Alpha):

To be clear, you are looking to smooth out data AND to estimate rate of change. But rate of change inherently amplifies noise. Any solution is going to have to make some tradeoffs.
Here is a simple hack based on your existing technique.
First, let's look at a general version of a basic decaying sum. Let's keep the following variables:
average_value
average_time
average_weight
And you have a decay rate decay.
To update with a new observation (value, time) you simply:
average_weight *= (1 - decay)**(time - average_time)
average_value = (average_value * average_weight + value) / (1 + average_weight)
average_time = (average_time * average_weight + time) / (1 + average_weight)
average_weight += 1
Therefore this moving average represents where your weight was some time ago. The slower the decay, the farther back it goes and the more smoothed out it is. Given that we want rate of change, the when is going to matter.
Now let's look at a first derivative. You have correctly put out a formula for estimating a first derivative. But at what time is that estimated derivative at? The answer turns out to be at time (t_0 + t_1) / 2. Any other time you pick, it will be systematically off based on the third derivative.
So you can play around with it, but you can estimate a derivative based on any source of values and timestamps. You can do it from your first derivative, or do it from a weighted average. You can even combine them. You can also do a running weighted average of the first derivative! But whatever you do, you need to keep track of WHEN it is a derivative FOR. (This is why I went through and discussed how far back a weighted average is, you need to think clearly about timestamping every piece of data you have, averaged or not.)
And now we have your second derivative. You have all the same choices for the second derivative that you do for the first. Except your measurements don't give a first derivative.
The third derivative follows the same pattern of choices.
However you do it, keep in mind the following.
Each derivative will be delayed.
The more up to date you keep them, the more noise will be a problem.
Make sure to think clearly about both what the measurement is, and when it is as of.
It may require experimentation to find what works best for your application.

Related

How to measure "homogeneity" of time series?

I have two time series, see this pic:
I need to measure the level of "homogeneity" of the series. So the first one looks very fragmented, so it should have low value close to zero and the second one should have a high value.
Any ideas of an algorithm I could use?
I'm not sure what is meant by homogeneity, but there is a well-established notion of stationarity of a time series. Basically, a time series is stationary if its rolling mean and standard deviation are constant across time. Both of your time series seem to have roughly constant mean, but the top one has a standard deviation that changes wildly across time; sometimes it's almost zero, and at other times, it's very large. Perhaps you could take the standard deviation of the rolling standard deviation, which will be far higher for the top series than for the bottom. If you can load them into pandas as top and bottom, it might look like
top_nonstationarity = np.std(top.rolling(window_size).std())
bottom_nonstationarity = np.std(bottom.rolling(window_size).std())
It might help to know more about the underlying difference between the series, or what you care about, but here goes...
I would subtract constants, if required, to give both series mean zero, and then square them to get something resembling power and filter this enough to smooth away what seems to be noise in the case of the lower filter. Then compute and compare the variances of the two filtered powers, which for the lower time series I would now expect to be a fairly constant line with a few drops down and for the upper series something spending about half of its time near zero and about half of its time away from it.
Possible filters include a simple moving average, whatever your time series toolkit provides, and those described at https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter

Rapid change detection algorithm

I'm logging temperature values in a room, saving them to the database. I'd like to be alerted when temperature rises suddenly. I can't set fixed values, because 18°C is acceptable in winter and 25°C is acceptable in summer. But if it jumps from 20°C to 25°C during, let's say, 30 minutes and stays like this for 5 minutes (to eliminate false readouts), I'd like to be informed.
My current idea is to take readouts from last 30 minutes (A) and readouts from last 5 minutes (B), calculate median of A and B and check if difference between them is less then my desired threshold.
Is this correct way to solve this or is there a better algorithm? I searched for a specific one but most of them seem overcomplicated.
Thanks!
Detecting changes in a time-series is a well-researched subject, and hundreds if not thousands of papers have been written on this subject. As you've seen many methods are quite advanced, but proved to be quite useful for many use cases. Whatever method you choose, you should evaluate it against real of simulated data, and optimize its parameters for your use case.
As you require, let me suggest a very simple method that in many cases prove to be good enough, and is quite similar to that you considered.
Basically, you have two concerns:
Detecting a monotonous change in a sampled noisy signal
Ignoring false readouts
First, note that medians are not commonly used for detecting trends. For the series (1,2,3,30,35,3,2,1) the medians of 5 consecutive terms is be (3, 3, 3, 3). It is much more common to use averages.
One common trick is to throw the extreme values before averaging (e.g. for each 7 values average only the middle 5). If many false readouts are expected - try to take measurements at a faster rate, and throw more extreme values (e.g. for each 13 values average the middle 9).
Also, you should throw away unfeasible values and replace them with the last measured value (unfeasible means out of range, or non-physical change rate).
Your idea of comparing a short-period measure with a long-period measure is a good idea, and indeed it is commonly used (e.g. in econometrics).
Quoting from "Financial Econometric Models - Some Contributions to the Field [Nicolau, 2007]:
Buy and sell signals are generated by two moving averages of the price
level: a long-period average and a short-period average. A typical
moving average trading rule prescribes a buy (sell) when the
short-period moving average crosses the long-period moving average
from below (above) (i.e. when the original time series is rising
(falling) relatively fast).
When you say "rises suddenly," mathematically you are talking about the magnitude of the derivative of the temperature signal.
There is a nice algorithm to simultaneously smooth a signal and calculate its derivative called the Savitzky–Golay filter. It's explained with examples on Wikipedia, or you can use Matlab to help you generate the convolution coefficients required. Once you have the coefficients the calculation is very simple.

Calculate enemy's hitpoints (extrapolate), based on change rate?

In our game, we have a boss (NPC), who's life is being checked on a time interval, say 1 minute.
I need to find a way to extrapolate known points (life,time), and approximately predict life after one minute (after 1 minute life will be checked again, and will be put in data set)
Also, extrapolation needs to consider mostly recent change (for instance, if we have 10 points, and last two have changed rapidly, it should be able to predict even more rapid change at next point).
I found multiple example this one and this one, but seems like I'm not able to translate all this in as3 code. Basically what I was looking for was 2D Extrapolation.
P.S. The point is that any calculated value should not get above any previous values, that is because the boss' hit points cannot increase, and also cannot stay the same; they can only decrease. I guess that means extrapolation wouldn't do. So I'm looking for another algorithm that will do.
Consider a calculus-inspired approach. If we have a list d[i] of the damage at a past time iand the current time is n, then we can estimate d[n+1] using the previous values in the list. d[n] - d[n-1] provides an estimate of the change from d[n] to d[n+1] based on recent values, (d[n] - d[n-1]) - (d[n-1] - d[n-2]) provides an estimate of the change of that change, and so forth. The idea is to use differencing to estimate change. If you have a time-series data list d[i] = [a,b,c,...], and another list d2[i] = d[i] - d[i-1], then d2[] is the change in d[] for all times i > 1. Since d2[] is also a time series, you can use it to create a d3[], chaining the terms to provide an estimate:
d[n+1] ~ d[n] + ( (d[n] - d[n-1]) ) + ( (d[n] - d[n-1]) - (d[n-1] - d[n-2]) ) + ...
^last value ^ est. change ^est. change of change
d[n] d2[n] d3[n]
Granted that this makes a lot of assumptions about incoming data. The two most important problems I can think of:
This assumes that most recent change(s) are completely representative of future values- in cases where change terms are non-linear, this causes the estimation to lag behind the actual data
The "lag" becomes stronger as more terms are added - a better estimate (more terms) must be balanced with better agility (less terms)
Outliers in incoming data figure directly into the equation, and thus directly affect the resulting estimate

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Aging a dataset

For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?
Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.
I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article
Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)
It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.

Resources