Graph plotting: only keeping most relevant data - algorithm

In order to save bandwith and so as to not to have generate pictures/graphs ourselves I plan on using Google's charting API:
http://code.google.com/apis/chart/
which works by simply issuing a (potentially long) GET (or a POST) and then Google generate and serve the graph themselves.
As of now I've got graphs made of about two thousands entries and I'd like to trim this down to some arbitrary number of entries (e.g. by keeping only 50% of the original entries, or 10% of the original entries).
How can I decide which entries I should keep so as to have my new graph the closest to the original graph?
Is this some kind of curve-fitting problem?
Note that I know that I can do POST to Google's chart API with up to 16K of data and this may be enough for my needs, but I'm still curious

The flot-downsample plugin for the Flot JavaScript graphing library could do what you are looking for, up to a point.
The purpose is to try retain the visual characteristics of the original line using considerably fewer data points.
The research behind this algorithm is documented in the author's thesis.
Note that it doesn't work for any kind of series, and won't give meaningful results when you want a downsampling factor beyond 10, in my experience.
The problem is that it cuts the series in windows of equal sizes then keep one point per window. Since you may have denser data in some windows than others the result is not necessarily optimal. But it's efficient (runs in linear time).

What you are looking to do is known as downsampling or decimation. Essentially you filter the data and then drop N - 1 out of every N samples (decimation or down-sampling by factor of N). A crude filter is just taking a local moving average. E.g. if you want to decimate by a factor of N = 10 then replace every 10 points by the average of those 10 points.
Note that with the above scheme you may lose some high frequency data from your plot (since you are effectively low pass filtering the data) - if it's important to see short term variability then an alternative approach is to plot every N points as a single vertical bar which represents the range (i.e. min..max) of those N points.

Graph (time series data) summarization is a very hard problem. It's like deciding, in a text, what is the "relevant" part to keep in an automatic summarization of it. I suggest you use one of the most respected libraries for finding "patterns of interest" in time series data by Eamonn Keogh

Related

Performance comparsion: Algorithm S and Algorithm Z

Recently I ran into two sampling algorithms: Algorithm S and Algorithm Z.
Suppose we want to sample n items from a data set. Let N be the size of the data set.
When N is known, we can use Algorithm S
When N is unknown, we can use Algorithm Z (optimized atop Algorithm R)
Performance of the two algorithms:
Algorithm S
Time complexity: average number of scanned items is n(N+1)/n+1 (I compute the result, Knuth's book left this as exercises), we can say it O(N)
Space complexity: O(1) or O(n)(if returning an array)
Algorithm Z (I search the web, find the paper https://www.cs.umd.edu/~samir/498/vitter.pdf)
Time complexity: O(n(1+log(N/n))
Space complexity: in TAOCP vol2 3.4.2, it mentions Algorithm R's space complexity is O(n(1+log(N/n))), so I suppose Algorithm Z might be the same
My question
The model for Algorithm Z is: keep calling next method on the data set until we reach the end. So for the problem that N is known, we can still use Algorithm Z.
Based on the above performance comparison, Algorithm Z has better time complexity than Algorithm S, and worse space complexity.
If space is not a problem, should we use Algorithm Z even when N is known?
Is my understanding correct? Thanks!
Is the Postgres code mentioned in your comment actually used in production? In my opinion, it really should be reviewed by someone who has at least some understanding of the problem domain. The problem with random sampling algorithms, and random algorithms in general, is that it is very hard to diagnose biased sampling bugs. Most samples "look random" if you don't look too hard, and biased sampling is only obvious when you do a biased sample of a biased dataset. Or when your biased sample results in a prediction which is catastrophically divergent from reality, which will eventually happen but maybe not when you're doing the code review.
Anyway, by way of trying to answer the questions, both the one actually in the text of this post and the ones added or implied in the comment stream:
Properly implemented, Vitter's algorithm Z is much faster than Knuth's algorithm S. If you have a use case in which reservoir sampling is indicated, then you should probably use Vitter, subject to the code testing advice above: Vitter's algorithm is more complicated and it might not be obvious how to validate the implementation.
I noticed in the Postgres code that it just uses the threshold value of 22 to decide whether to use the more complicated code, based on testing done almost 40 years ago on hardware which you'd be hard pressed to find today. It's possible that 22 is not a bad threshold, but it's just a number pulled out of thin air. At least some attempt should be made to verify or, more likely, correct it.
Forty years ago, when those algorithms were developed, large datasets were typically stored on magnetic tape. Magnetic tape is still used today, but applications have changed; I think that you're not likely to find a Postgres installation in which a live database is stored on tape. This matters because the way you get data off a tape drive is radically different from the way you get data from a file server. Or a sharded distributed collection of file servers, which also has its particular needs.
Data on a reel of tape can only be accessed linearly, although it is possible to skip tape somewhat faster than you can read it. On a file server, data is random access; there may be a slight penalty for jumping around in a file, but there might not. (On the sharded distributed model, it might well be faster then linear reads.) But trying to read out of order on a tape drive might turn an input operation which takes an hour into an operation which takes a week. So it's very important to access the sample in order. Moreover, you really don't want to have to read the tape twice, which would take twice as long.
One of the other assumptions that was made in those algorithms is that you might not have enough memory to store the entire sample; in 1985, main memory was horribly expensive and databases were already quite large. So a common way to collect a large sample from a huge database was to copy the sampled blocks onto secondary memory, such as another tape drive. But there's a bit of a catch with reservoir sampling: as the sampling algorithm proceeds, some items which were initially inserted in the sample are later replaced with other items. But you can't replace data written on tape, so you need to just keep on appending the newly selected samples. What you do hold in random access memory is a list of locations of the sample; once you've finished selecting the sample, you can sort this list of locations and then use it to read out the final selection in storage order, skipping over the rejected items. That means that the temporary sample storage ends up holding both the final sample, and some number of later rejected items. The O(n(1+log(N/n))) space complexity in Algorithm R refers to precisely this storage, and it's actually a reasonably small multiplier, considering.
All that is irrelevant if you can just allocate enough random access storage somewhere to hold the entire sample. Or, even better, if you can directly read a data from the database. There could well still be good reasons to read the sample into local storage, but nothing stops you from updating a block of local storage with a different block.
On the other hand, in many common cases, you don't need to read the data in order to sample it. You can just take a list of items numbers, select a sample from that list of the desired size, and then set about acquiring the sample from the list of selected item numbers. And that presents a rather different problem: how to choose an unbiased sample of size k from a set of K item indexes.
There's a fast and simple solution to that (also described by Knuth, unsurprisingly): make an array of all the item numbers (say, the integers from 0 to K, and then shuffle the array using the standard Knuth/Fisher-Yates shuffle, with a slight modification: you run the algorithm from front to back (instead of back to front, as it is often presented), and stop after k iterations. At that point the first k elements in the partially shuffled array are an unbiased sample. (In fact, you don't need the entire vector of K indices, as long as k is much smaller than K. You're only going to touch O(k) of the values, and you can keep the ones you touched in a hash table of size O(k).)
And there's an even simpler algorithm, again for the case where the sample is small relative to the dataset: just keep one bit for each item in the dataset, which indicates that the item has been selected. Now select k items at random, marking the bit vector as you go; if the relevant bit is already marked, then that item is already in the sample; you just ignore that selection and continue with the next random choice. The expected number of ignored sample is very small unless the sample size is a significant fraction of the dataset size.
There's one other criterion which weighed on the minds of Vitter and Knuth: you'll normally want to do something with the selected sample. And given the amount of time it takes to read through a tape, you want to be able to start processing each item immediately as it is accepted. That precludes algorithms which include, for example, "sort the selected indices and then read the indicated items. (See above.) For immediate processing to be possible, you must not depend on being able to "deselect" already selected items.
Fortunately, both the quick algorithms mentioned at the end of point 2 do satisfy this requirement. In both cases, an item once selected will never be later rejected.
There is at least one use case for reservoir sampling which is still very much relevant: sampling a datastream which is too voluminous or too high-bandwidth to store. That might be some kind of massive social media feed, or it might be telemetry data from a large sensor array, or whatever. In that case, you might want to reduce the size of the datastream by extracting only a small sample, and reservoir sampling is a good candidate. However, that has nothing to do with the Postgres example.
In summary:
Yes, you can (and probably should) use Vitter's Algorithm Z in preference to Knuth's Algorithm S, even if you know how big the data set it.
But there are certainly better algorithms, some of which are outlined above.

Algorithm for finding similar images using an index

There are some surprisingly good image compare tools which find similar image even if it's not exactly the same (eg. change in size, wallpaper, brightness/contrast). I have some example applications here:
Unique Filer 1.4 (shareware): https://web.archive.org/web/20010309014927/http://uniquefiler.com/
Fast Duplicate File Finder (Freeware): http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm
Visual similarity duplicate image finder (payware): http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm
Duplicate Checker (payware): http://www.duplicatechecker.com/
I only tried the first one, but all of them are developed for Windows and are not open source. Unique Filer was released in 2000 and the homepage seems to have disappeared. It was surprisingly fast (even on computers from that year) because it used an index and comparing some 10000 images using the index needed only some few seconds (and updating the index was a scalable process).
Since this algorithm in a very effective form already exists for at least 15 years, I assume it is well-documented and possibly already implemented as an open source library. Does anyone knows more about which algorithm or image detection theory was used to implement this applications? Maybe there is even a open source implementation of it available?
I already checked the question Algorithm for finding similar images but all of it's answers solve the problem by comparing one image to another. For 1000+ images this will result in 1000^2 comparing operations which is just not what I'm looking for.
The problem you are describing is more generally called Nearest Neighbor Search. Since you are asking for high efficiency on large datasets, Approximated Nearest Neighbor Search is what you are after.
An efficient technique for this is Locality-Sensitive Hashing (LSH), for which these slides give a great overview. Its basic idea is the use of hashing functions which project all data to a low-dimensional space, with the constraint that the hash of similar data collides with a high probability and dissimilar data collides with low probability. These probabilities are parameters to the algorithm, with which the trade-off between accuracy and efficiency can be changed.
LSHKIT is an open-source implementation of LSH.
Meanwhile, I analyzed the algorithm of UniqueFiler:
size reduction
First, it reduces all images to 10x10 pixel grayscale images (likely without using interpolation)
rotation
Probably based on the brightness of the 4 quadrants, some rotation is done (this step is dangerous because it sometimes 'overlooks' similarities if images are too symmetric)
range reduction
The image brightness range is fully extended (brightest -> white, darkest -> black) and then reduced to 2 bit (4 values) per pixel
database
The values get stored as arrays of 100 bytes per image (plus file metadata)
comparison
... is done one-by-one (two nested loops over the whole database plus a third for the 100 bytes). Today, we would probably index the sorted sums of all 4 quadrants for a fast pre-selection of similar candidates.
matcher
The comparison is done byte-by-byte by difference between each two bytes, weighted but less than the square. The sum of these 100 results is the final difference between two images.
I have more detailed information a home. If I find the time, I will add them to this answer. I found this after I discovered that the database format is actually a gzipped file without header, containing fixed-sized records per image

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.
I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!
Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

Metric for SURF

I'm searching for a usable metric for SURF. Like how good one image matches another on a scale let's say 0 to 1, where 0 means no similarities and 1 means the same image.
SURF provides the following data:
interest points (and their descriptors) in query image (set Q)
interest points (and their descriptors) in target image (set T)
using nearest neighbor algorithm pairs can be created from the two sets from above
I was trying something so far but nothing seemed to work too well:
metric using the size of the different sets: d = N / min(size(Q), size(T)) where N is the number of matched interest points. This gives for pretty similar images pretty low rating, e.g. 0.32 even when 70 interest points were matched from about 600 in Q and 200 in T. I think 70 is a really good result. I was thinking about using some logarithmic scaling so only really low numbers would get low results, but can't seem to find the right equation. With d = log(9*d0+1) I get a result of 0.59 which is pretty good but still, it kind of destroys the power of SURF.
metric using the distances within pairs: I did something like find the K best match and add their distances. The smallest the distance the similar the two images are. The problem with this is that I don't know what are the maximum and minimum values for an interest point descriptor element, from which the distant is calculated, thus I can only relatively find the result (from many inputs which is the best). As I said I would like to put the metric to exactly between 0 and 1. I need this to compare SURF to other image metrics.
The biggest problem with these two are that exclude the other. One does not take in account the number of matches the other the distance between matches. I'm lost.
EDIT: For the first one, an equation of log(x*10^k)/k where k is 3 or 4 gives a nice result most of the time, the min is not good, it can make the d bigger then 1 in some rare cases, without it small result are back.
You can easily create a metric that is the weighted sum of both metrics. Use machine learning techniques to learn the appropriate weights.
What you're describing is related closely to the field of Content-Based Image Retrieval which is a very rich and diverse field. Googling that will get you lots of hits. While SURF is an excellent general purpose low-mid level feature detector, it is far from sufficient. SURF and SIFT (what SURF was derived from), is great at duplicate or near-duplicate detection but is not that great at capturing perceptual similarity.
The best performing CBIR systems usually utilize an ensemble of features optimally combined via some training set. Some interesting detectors to try include GIST (fast and cheap detector best used for detecting man-made vs. natural environments) and Object Bank (a histogram-based detector itself made of 100's of object detector outputs).

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Resources