Detect points that are very scattered from the rest of the data - algorithm

I have a set of results (numbers), and I would like to know if a given result is very good/bad compared to the previous results (only previous).
Each result is a number € IR+. For example if you have the sequence 10, 11, 10, 9.5, 16 then 16 is clearly a very good result compared to the previous ones. I would like to find an algorithm to detect this situation (very good/bad result compared to previous results).
A more general way to state this problem is : how to determine if a point - in a given set of data - is scattered from the rest of the data.
Now, that might look like a peak detection problem, but since the previous values are not constant there are many tiny peaks, and I only want the big ones.
My first idea was to compute the mean and determine the standard deviation but it is quite limited. Indeed, if there is one huge/low value in the previous results it will change dramatically the mean/stadard deviation and the next results will have to be even greater/lower to beat the standard deviation (in order to be detected) and therefor many points will not be (properly) detected.
I'm quite sure that must a well known problem.
Can anyone help me on this ?

This kind of problem is called Anomaly Detection.

Related

Why is a finite sum calculated so long?

I'm trying to compute the next sum:
It is calculated instantly. So I raise the number of points to 24^3 and it still works fast:
But when the number of points is 25^3 it's almost impossible to await the result! Moreover, there is a warning:
Why is it so time-consuming to calculate a finite sum? How can I get a precise answer?
Try
max=24;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.143978,14330.9}
and
max=25;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.156976,14636.6}
and even
max=50;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{1.36679,16932.5}
Changing your code in this way avoids doing hundreds or thousands of If tests that will almost always result in True. And it potentially uses symbolic algorithms to find those results instead of needing to add up each one of the individual values.
Compare those results and times if you replace Sum with NSum and if you replace /500 with *.002
To try to guess why the times you see suddenly change as you increment the bound, other people have noticed in the past that it appears there are some hard coded bounds inside some of the numerical algorithms and when a range is small enough Mathematica will use one algorithm, but when the range is just large enough to exceed that bound then it will switch to another and potentially slower algorithm. It is difficult or impossible to know exactly why you see this change without being able to inspect the decisions being made inside the algorithms and nobody outside Wolfram gets to see that information.
To get a more precise numerical value you can change N[...] to N[...,64] or N[...,256] or eliminate the N entirely and get a large complicated exact numeric result.
Be cautious with this, check the results carefully to make certain that I have not made any mistakes. And some of this is just guesswork on my part.

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.
I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!
Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

Algorithm for deviations

I have to track if given a week full of data integers ( 40, 30, 25, 55, 5, 40, etc ) raise an alert when the deviation from the norm happens (the '5' in the above case). An extra nice thing to have would be to actually learn if 5 is a normal event for that day of the week.
Do you know an implementation in ruby that is meant for this issue? In case this is a classic problem, what's the name of the problem/algorithm?
It's a very easy thing to calculate, but you will need to tune one parameter. You want to know if any given value is X standard deviations from the mean. To figure this out, calculate the standard deviation (see Wikipedia), then compare each value's deviation abs(mean - value) from the mean to this value. If a value's deviation is say, more than two standard deviations from the mean, flag it.
Edit:
To track deviations by weekday, keep an array of integers, one for each day. Every time you encounter a deviation, increment that day's counter by one. You could also use doubles and instead maintain a percentage of deviations for that day (num_friday_deviations/num_fridays) for example.
This is often referred to as "anomaly detection" and there is a lot of work out there if you google for it. The paper Mining Deviants in Time Series Data Streams may help you with your specific needs.
From the abstract:
We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor
streams using very small space (polylogarithmic in data
size) and are able to quickly find deviants at any instant,
as the data stream evolves over time.
http://en.wikipedia.org/wiki/Control_chart describes classical ways of doing this sort of thing. As Jonathan Feinberg commented, there are different approaches.
The name of the algorithm could be as simple as "calculate standard deviation."
http://en.wikipedia.org/wiki/Standard_deviation
However, any analysis you do should be specific to the data set. You should inspect historical data to get at the right algorithm. Standard deviation won't be a good measure at all unless your data is normally distributed. Your data might even be such that you just want to look for numbers above a certain max value... it really depends.
So, my advice to you is:
1) Google for statistics overview and read up on basic statistics.
2) Inspect any historical data you have.
3) Come up with some reasonable measure of an odd number.
4) Test your measure against your historical data and see if it highlights the numbers you think it should.
5) Repeat steps 2-4 as necessary to refine your algorithm.

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Resources