Simple trend analysis algorithm - algorithm

OK, so you have some historic data in the form of [say] an array of integers. This, for example, could represent free-space on a server HDD over a two-year period, with each array element representing a daily sample.
The data (free-space in this example) has a downward trend, but also has periodic positive spikes where files have been removed/compressed, Etc.
How would you go about identifying the overall trend for the two-year period, i.e.: iron out the peaks and troughs in the data?
Now, I did A-level statistics and then a stats module in my degree, but I've slept over 7,000 times since then, and well, it's leaked out of my brain.
I'm not after a bit of code as such, more of a description of how you'd approach this problem...
Thanks in advance!

You'll get many different answers, and the one you choose really depends on more specific requirements you may have. Examples:
Low-pass filter, or any other spectral analysis technique, and use the low frequencies to determine trend.
Linear regression (time/value) to find "r" (the correlation between time and the value).
Moving average of last "n" samples. If "n" is large enough this is my favorite as many times this is sufficient, and is very easy to code. It's a sort of approximation to #1 above.
I'm sure they'll be others.

If I was doing this to produce a line through points for me to look at, I would probably use a some variant of Loess, described at http://en.wikipedia.org/wiki/Local_regression, http://stat.ethz.ch/R-manual and /R-patched/library/stats/html/loess.html. Basically, you find the smoothed value at any particular point by doing a weighted regression on the data points near that point, with the nearest points given the most weight.

Related

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Smoothing values over time: moving average or something better?

I'm coding something at the moment where I'm taking a bunch of values over time from a hardware compass. This compass is very accurate and updates very often, with the result that if it jiggles slightly, I end up with the odd value that's wildly inconsistent with its neighbours. I want to smooth those values out.
Having done some reading around, it would appear that what I want is a high-pass filter, a low-pass filter or a moving average. Moving average I can get down with, just keep a history of the last 5 values or whatever, and use the average of those values downstream in my code where I was once just using the most recent value.
That should, I think, smooth out those jiggles nicely, but it strikes me that it's probably quite inefficient, and this is probably one of those Known Problems to Proper Programmers to which there's a really neat Clever Math solution.
I am, however, one of those awful self-taught programmers without a shred of formal education in anything even vaguely related to CompSci or Math. Reading around a bit suggests that this may be a high or low pass filter, but I can't find anything that explains in terms comprehensible to a hack like me what the effect of these algorithms would be on an array of values, let alone how the math works. The answer given here, for instance, technically does answer my question, but only in terms comprehensible to those who would probably already know how to solve the problem.
It would be a very lovely and clever person indeed who could explain the sort of problem this is, and how the solutions work, in terms understandable to an Arts graduate.
If you are trying to remove the occasional odd value, a low-pass filter is the best of the three options that you have identified. Low-pass filters allow low-speed changes such as the ones caused by rotating a compass by hand, while rejecting high-speed changes such as the ones caused by bumps on the road, for example.
A moving average will probably not be sufficient, since the effects of a single "blip" in your data will affect several subsequent values, depending on the size of your moving average window.
If the odd values are easily detected, you may even be better off with a glitch-removal algorithm that completely ignores them:
if (abs(thisValue - averageOfLast10Values) > someThreshold)
{
thisValue = averageOfLast10Values;
}
Here is a guick graph to illustrate:
The first graph is the input signal, with one unpleasant glitch. The second graph shows the effect of a 10-sample moving average. The final graph is a combination of the 10-sample average and the simple glitch detection algorithm shown above. When the glitch is detected, the 10-sample average is used instead of the actual value.
If your moving average has to be long in order to achieve the required smoothing, and you don't really need any particular shape of kernel, then you're better off if you use an exponentially decaying moving average:
a(i+1) = tiny*data(i+1) + (1.0-tiny)*a(i)
where you choose tiny to be an appropriate constant (e.g. if you choose tiny = 1- 1/N, it will have the same amount of averaging as a window of size N, but distributed differently over older points).
Anyway, since the next value of the moving average depends only on the previous one and your data, you don't have to keep a queue or anything. And you can think of this as doing something like, "Well, I've got a new point, but I don't really trust it, so I'm going to keep 80% of my old estimate of the measurement, and only trust this new data point 20%". That's pretty much the same as saying, "Well, I only trust this new point 20%, and I'll use 4 other points that I trust the same amount", except that instead of explicitly taking the 4 other points, you're assuming that the averaging you did last time was sensible so you can use your previous work.
Moving average I can get down with ...
but it strikes me that it's probably
quite inefficient.
There's really no reason a moving average should be inefficient. You keep the number of data points you want in some buffer (like a circular queue). On each new data point, you pop the oldest value and subtract it from a sum, and push the newest and add it to the sum. So every new data point really only entails a pop/push, an addition and a subtraction. Your moving average is always this shifting sum divided by the number of values in your buffer.
It gets a little trickier if you're receiving data concurrently from multiple threads, but since your data is coming from a hardware device that seems highly doubtful to me.
Oh and also: awful self-taught programmers unite! ;)
An exponentially decaying moving average can be calculated "by hand" with only the trend if you use the proper values. See http://www.fourmilab.ch/hackdiet/e4/ for an idea on how to do this quickly with a pen and paper if you are looking for “exponentially smoothed moving average with 10% smoothing”. But since you have a computer, you probably want to be doing binary shifting as opposed to decimal shifting ;)
This way, all you need is a variable for your current value and one for the average. The next average can then be calculated from that.
there's a technique called a range gate that works well with low-occurrence spurious samples. assuming the use of one of the filter techniques mentioned above (moving average, exponential), once you have "sufficient" history (one Time Constant) you can test the new, incoming data sample for reasonableness, before it is added to the computation.
some knowledge of the maximum reasonable rate-of-change of the signal is required. the raw sample is compared to the most recent smoothed value, and if the absolute value of that difference is greater than the allowed range, that sample is thrown out (or replaced with some heuristic, eg. a prediction based on slope; differential or the "trend" prediction value from double exponential smoothing)

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

Is a kd-tree suitable for 4D space-time data (x,y,z,time)?

I want to use a data structure for sorting space-time data (x,y,z,time).
Currently a processing algorithm searches a set of 4D (x,y,z,time) points, given a spherical (3d) spacial radius and a linear (1d) time radius, marking for each point, which other points are within those radii. The reason is that after processing, I can ask any 4D point for all of its neighbours in O(1) time.
However in some common configurations of space and time radii, the first run of the algorithm takes about 12 hours. Believe it or not, that's actually fast compared to what exists in our industry. Nevertheless, I want to help speed up the initial runs and so I want to know: Is a kd-tree suitable for 4D space-time data?
Note that I am not looking for implementations of nearest-neighbour search or k-nearest-neighbours search.
More Info:
An example dataset has 450,000 4D points.
Some datasets are time-dense so ordering by time certainly saves processing, but still leads to many distance checks.
Time is represented by Excel-style dates, with typical ranges between 30,000-39,000 (approximate). The space ranges are sometimes higher values, sometimes lower values, but the range between each space co-ordinate is similar to time (e.g. maxX-minX ~ maxT-minT).
Even more info:
I thought I'd add some more slightly irrelevant data in case anybody has dealt with a similar dataset.
Basically I'm working with data that represents space-time events that are recorded and corroborated by multiple sensors. Error is involved, so only events that meet an error threshold are included.
The time span of these datasets ranges between 5-20 years of data.
For the really old data (>8 years old), the events were often very spacially dense for two reasons: 1) there were relatively few sensors available back then, and 2) the sensors were placed close together so that nearby events could be properly corroborated with low error. Further events could be recorded, but they had too high an error
For the newer data (<8 years old), the events are often very time dense, for the inverse reasons: 1) there are usually many sensors available, and 2) the sensors are placed at regular intervals over a larger distance.
As a result, the datasets cannot typically be said to be only time-dense or only spacially dense (except in the case of datasets that contain only new data).
Conclusion
I clearly should be asking more questions on this site.
I will be testing out several solutions over the next while which will include the 4d kd-tree, a 3d kd-tree followed by time distance check (suggested by Drew Hall), and the current algorithm I have. Also, I have been suggested another data structure called TSP (time space partitioning) tree, which uses an octree for space and a bsp on each node for time, so I may test that as well.
Assuming I remember, I'll be sure to post some profiling benchmarks on different time/space radii configurations.
Thanks all
To expand a little bit on my comments to an answer above:
According to the literature, kd-trees require data with Euclidean coordinates. They are probably not strictly necessary, but they're certainly sufficient: guaranteeing that all coordinates are Euclidean ensures that the normal rules of space apply, and makes it possible to easily partition points by their location and build up the tree structure.
Time is a little bit strange. Under the rules of special relativity, you use a Minkowski metric, not a standard Euclidean metric, when you're working with time coordinates. This causes all kinds of problems (most severe among them destroying the meaning of "simultaneity"), and generally makes people afraid of time coordinates. That fear is not well-founded, though, because unless you know you're working on physics, your time coordinate almost certainly actually will be Euclidean in practice.
What does it mean for a coordinate to be Euclidean? It should be independent of all the other coordinates. Saying time is a Euclidean coordinate means that you can answer the question "Are these two points close together in time?" by looking only at their time coordinates, and ignoring any extra information. It's easy to see why not having that property might break a scheme that partitions points by the values of their coordinates; if two points can have radically different time coordinates but still be considered "close in time", then a tree which sorts them by time coordinate is not going to work very well.
An example of a Euclidean time coordinate would be any time specified in a single, consistent time zone (like UTC times). If you have two clocks, one in New York and one in Tokyo, you know that if you have two measurements labelled "12:00 UTC" then they were taken at the same time. But if the measurements are taken in local time, so one says "12:00 New York time" and one is "12:00 Tokyo time", you have to use extra information about the locations and time zones of the cities to figure out how much time elapsed between the two measurements.
So as long as your time coordinate is consistently measured and sane, it will be Euclidean, and that means it will work just fine in a kd-tree or similar data structure.
If you stored an index to your points sorted in the time dimension, couldn't you first perform an initial pruning in the 1-d time dimension, thus reducing the number of distance calculations? (Or is that an oversimplfication?)
You haven't really given enough information to answer this.
But sure, in general kd-trees are perfectly suitable for 4 (or 5 or 6 or...) dimensional data --- if the spatial (or in your case space/time-ial) distribution lends itself to kd-tree decomposition. In other words, it depends (sound familiar?).
kd-trees are just one method of spatial decomposition that lend themselves to certain localized searches. As you go to higher dimensions, the curse of dimensionality problem rears it's head, of course, but 4d isn't too bad (you probably want at least a several hundred points though).
In order to know if this will work for you, you have to analyse some other criteria. Is approximate NN search good enough (this can help a lot). Is tree balancing likely to be expensive? etc.
If your data is relatively time-dense (and relatively space-sparse), it might work best to use a 3d kd-tree on the spatial dimensions, then simply reject the points that are outside the time window of interest. That would get around your mixed space/time metric problem, at the expense of a slightly more complex point struct.

Resources