Aging a dataset - algorithm

Aging a dataset - algorithm

For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?

Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.

I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article

Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)

It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.

Related

How to effeciently compute the first, second, and third derivatives of live updating data?

I have a running/decaying sum that updates over time with live data. I would like to efficiently compute the first, second, and third derivatives.
The simplest way I can think of doing this is to calculate deltas over some time difference in the running/decaying sum. e.g.
t_0 sum_0
t_1 sum_1
first_derivative = (sum_1 - sum_0) / (t_1 - t0)
I can continue this process further with the second and third derivatives, which I think should work, but I'm not sure if this is the best way.
This running/decaying sum is not a defined function and relies on live updating data, so I can't just do a normal derivative.

I don't know what your real use case is, but it sounds like you're going about this the wrong way. For most cases I can imagine, what you really want to do is:
First determine the continuous signal that your time series represents; and then
You can exactly calculate the derivatives of this signal at any point.
Since you have already decided that your time series represents exponential decay with discontinuous jumps, you have decided that all your derivatives are simply proportional to the current value and provide no extra information.
This probably isn't what you really want.
You would probably be better off applying a more sophisticated low-pass filter to your samples. In situations like yours, where you receive intermittent updates, it can be convenient to design the impulse response as a weighted sum of exponential decays with different (and possibly complex) time scales.
If you use 4 or 5 exponentials, then you can ensure that the value and first 3 derivatives of the impulse response are all smooth, so none of the derivatives you have to report are discontinuous.
The impulse response of any all-pole IIR filter can be written as the sum of exponentials in this way, though "partial fraction decomposition", but I guess there is a lot of learning between you and there right now. Those terms are all Googlable.
An example impulse response that would be smoother than an exponential decay, is this one, that's 0 in the first 3 derivatives:
5( e-t - 4e-2t + 6e-3t - 4e-4t + e-5t )
You can scale the decay times however you like. It looks like this (from Wolfram Alpha):

To be clear, you are looking to smooth out data AND to estimate rate of change. But rate of change inherently amplifies noise. Any solution is going to have to make some tradeoffs.
Here is a simple hack based on your existing technique.
First, let's look at a general version of a basic decaying sum. Let's keep the following variables:
average_value
average_time
average_weight
And you have a decay rate decay.
To update with a new observation (value, time) you simply:
average_weight *= (1 - decay)**(time - average_time)
average_value = (average_value * average_weight + value) / (1 + average_weight)
average_time = (average_time * average_weight + time) / (1 + average_weight)
average_weight += 1
Therefore this moving average represents where your weight was some time ago. The slower the decay, the farther back it goes and the more smoothed out it is. Given that we want rate of change, the when is going to matter.
Now let's look at a first derivative. You have correctly put out a formula for estimating a first derivative. But at what time is that estimated derivative at? The answer turns out to be at time (t_0 + t_1) / 2. Any other time you pick, it will be systematically off based on the third derivative.
So you can play around with it, but you can estimate a derivative based on any source of values and timestamps. You can do it from your first derivative, or do it from a weighted average. You can even combine them. You can also do a running weighted average of the first derivative! But whatever you do, you need to keep track of WHEN it is a derivative FOR. (This is why I went through and discussed how far back a weighted average is, you need to think clearly about timestamping every piece of data you have, averaged or not.)
And now we have your second derivative. You have all the same choices for the second derivative that you do for the first. Except your measurements don't give a first derivative.
The third derivative follows the same pattern of choices.
However you do it, keep in mind the following.
Each derivative will be delayed.
The more up to date you keep them, the more noise will be a problem.
Make sure to think clearly about both what the measurement is, and when it is as of.
It may require experimentation to find what works best for your application.

Calculate enemy's hitpoints (extrapolate), based on change rate?

In our game, we have a boss (NPC), who's life is being checked on a time interval, say 1 minute.
I need to find a way to extrapolate known points (life,time), and approximately predict life after one minute (after 1 minute life will be checked again, and will be put in data set)
Also, extrapolation needs to consider mostly recent change (for instance, if we have 10 points, and last two have changed rapidly, it should be able to predict even more rapid change at next point).
I found multiple example this one and this one, but seems like I'm not able to translate all this in as3 code. Basically what I was looking for was 2D Extrapolation.
P.S. The point is that any calculated value should not get above any previous values, that is because the boss' hit points cannot increase, and also cannot stay the same; they can only decrease. I guess that means extrapolation wouldn't do. So I'm looking for another algorithm that will do.

Consider a calculus-inspired approach. If we have a list d[i] of the damage at a past time iand the current time is n, then we can estimate d[n+1] using the previous values in the list. d[n] - d[n-1] provides an estimate of the change from d[n] to d[n+1] based on recent values, (d[n] - d[n-1]) - (d[n-1] - d[n-2]) provides an estimate of the change of that change, and so forth. The idea is to use differencing to estimate change. If you have a time-series data list d[i] = [a,b,c,...], and another list d2[i] = d[i] - d[i-1], then d2[] is the change in d[] for all times i > 1. Since d2[] is also a time series, you can use it to create a d3[], chaining the terms to provide an estimate:
d[n+1] ~ d[n] + ( (d[n] - d[n-1]) ) + ( (d[n] - d[n-1]) - (d[n-1] - d[n-2]) ) + ...
^last value ^ est. change ^est. change of change
d[n] d2[n] d3[n]
Granted that this makes a lot of assumptions about incoming data. The two most important problems I can think of:
This assumes that most recent change(s) are completely representative of future values- in cases where change terms are non-linear, this causes the estimation to lag behind the actual data
The "lag" becomes stronger as more terms are added - a better estimate (more terms) must be balanced with better agility (less terms)
Outliers in incoming data figure directly into the equation, and thus directly affect the resulting estimate

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.

I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.

There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Algorithm(s) for spotting anomalies ("spikes") in traffic data

I find myself needing to process network traffic captured with tcpdump. Reading the traffic is not hard, but what gets a bit tricky is spotting where there are "spikes" in the traffic. I'm mostly concerned with TCP SYN packets and what I want to do is find days where there's a sudden rise in the traffic for a given destination port. There's quite a bit of data to process (roughly one year).
What I've tried so far is to use an exponential moving average, this was good enough to let me get some interesting measures out, but comparing what I've seen with external data sources seems to be a bit too aggressive in flagging things as abnormal.
I've considered using a combination of the exponential moving average plus historical data (possibly from 7 days in the past, thinking that there ought to be a weekly cycle to what I am seeing), as some papers I've read seem to have managed to model resource usage that way with good success.
So, does anyone knows of a good method or somewhere to go and read up on this sort of thing.
The moving average I've been using looks roughly like:
avg = avg+0.96*(new-avg)
With avg being the EMA and new being the new measure. I have been experimenting with what thresholds to use, but found that a combination of "must be a given factor higher than the average prior to weighing the new value in" and "must be at least 3 higher" to give the least bad result.

This is widely studied in intrusion detection literature. This is a seminal paper on the issue which shows, among other things, how to analyze tcpdump data to gain relevant insights.
This is the paper: http://www.usenix.org/publications/library/proceedings/sec98/full_papers/full_papers/lee/lee_html/lee.html here they use the RIPPER rule induction system, I guess you could replace that old one for something newer such as http://www.newty.de/pnc2/ or http://www.data-miner.com/rik.html

I would apply two low-pass filters to the data, one with a long time constant, T1, and one with a short time constant, T2. You would then look at the magnitude difference in output from these two filters and when it exceeds a certain threshold, K, then that would be a spike. The hardest part is tuning T1, T2 and K so that you don't get too many false positives and you don't miss any small spikes.
The following is a single pole IIR low-pass filter:
new = k * old + (1 - k) * new
The value of k determines the time constant and is usually close to 1.0 (but < 1.0 of course).
I am suggesting that you apply two such filters in parallel, with different time constants, e.g. start with say k = 0.9 for one (short time constant) and k = 0.99 for the other (long time constant) and then look at the magnitude difference in their outputs. The magnitude difference will be small most of the time, but will become large when there is a spike.

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.

I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.

Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php

In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)

Look at statistical sites. I think you are looking for correlation.

As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.

Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.

First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.

I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.

A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).

Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.

Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio