Finding regions in scattered data - algorithm

I have a number of scattered data sets in Nx3 matrices, a simple example plotted with scatter3 is shown below (pastebin of the raw values):
Each of my data sets have an arbitrary number of regions/blobs; the example above for instance has 4.
Does anyone know of a simple method to programmatically find the number of regions in this form of data?
My initial idea was to use a delaunayTriangulation, convexHull approach, but without any data treatment this will still only find the outer volume of the entire plot rather than each region.
The next idea I have would involve grabbing nearest neighbour statistics of each point, asking if it's within a grid size distance of another point, then lumping those that are in to separate blobs/clusters.
Is there a higher level Matlab function I'm not aware of that could assist me here, or does anyone have a better suggestion of how to pull the region count out of data like this?

It sounds like you need a clustering algorithm. Fortunately for you, MATLAB provides a number of these out of the box. There are plenty of algorithms to choose from, and it sounds like you need something where the number of clusters is unknown beforehand, correct?
If this is the case, and your data is as "nice" as your example I would suggest kmeans combined with a technique to properly choose "k", as suggested here.
There are other options of course, I recommend you learn more about the clustering options in MATLAB, here's a nice reference for more reading.

Determining the number of different clusters in a dataset is a tricky problem and probably hard than what we might thought on first sight. In fact, algorithms like k-means are heavily dependent on this. Wikipedia has a nice article on it, but no clear and easy method.
The Elbow method as mentioned in there seems to be comparatively easy to do, although might be computationally costly. In essence, you could just try to use different number of clusters, and choose the number where the variance explained doesn't grows much and plateaued out.
Also, the notion of a cluster need to be clearly defined - what if zooming into any of the blobs displays a similar structure as the corner structure in the picture?

I would suggest implementing a "light" version of the Gaussian Mixture Model.
Have each point "vote" for a cube. In the above example, all points centered around (-1.5,-1.5,0)
would each add a +1 to the square [-1,-2]x[-1,-2]x[0.2,-0.2]. Finally, you can analyze the peaks in the voting matrix.

In the interest of completeness there's a vastly simpler answer to this problem (which I have built) than Hierarchical clustering; which gives much better results and can differentiate between 1 cluster or 2 (an issue that I couldn't manage to fix with MarkV's suggestions). This assumes your data is on a regular grid of known size, and you have an unknown amount of clusters that are separated by at least 2*(grid size):
% Idea is as follows:
% * We have a known grid size, dx.
% * A random point [may as well be minima(1,:)] will be in a cluster of
% values if any others in the list lie dx away (with one dimention
% varied), sqrt(2)*dx (two dimensions varied) or sqrt(3)*dx (three
% dimensions varied).
% * Chain these objects together until all are found, any with distances
% beyond sqrt(3)*dx of the cluster are ignored for now.
% * Set this cluster aside, repeat until no minima data is left.
function [blobs, clusterIdx] = findClusters(minima,dx)
%problem setup
dx2 = sqrt(2)*dx;
dx3 = sqrt(3)*dx;
eqf = #(list,dx,dx2,dx3)(abs(list-dx) < 0.001 | abs(list-dx2) < 0.001 | abs(list-dx3) < 0.001);
notDoneClust = true;
notDoneMinima = true;
clusterIdx = zeros(size(minima,1),1);
point = minima(1,:);
list = minima(2:end,:);
blobs = 0;
while notDoneMinima
cluster = nan(1,3);
while notDoneClust
[~, dist] = knnsearch(point,list); %All distances to each other point in data
nnidx = eqf(dist,dx,dx2,dx3); %finds indexes of nn values to point
cluster = cat(1,cluster,point,list(nnidx,:)); %add points to current cluster
point = list(nnidx,:); %points to check are now all values that are nn to initial point
list = list(~nnidx,:); %list is now all other values that are not in that list
notDoneClust = ~isempty(point); %if there are no more points to check, all values of the cluster have been found
end
blobs = blobs+1;
clusterIdx(ismemberf(minima,cluster(2:end,:),'rows')) = blobs;
%reset points and list for a new cluster
if ~isempty(list)
if length(list)>1
point = list(1,:);
list = list(2:end,:);
notDoneClust = true;
else
%point is a cluster of its own. Don't reset loops, add point in
%as a cluster and exit (NOTE: I have yet to test this portion).
blobs = blobs+1;
clusterIdx(ismemberf(minima,point,'rows')) = blobs;
notDoneMinima = false;
end
else
notDoneMinima = false;
end
end
end
I fully understand this method is useless for clustering data in the general sense, as any outlying data will be marked as a separate cluster. This (if it happens) is what I need anyway, so this may just be an edge case scenario.

Related

Questions about feature selection and data engineering when using H2O autoencoder for anomaly detection

I am using H2O autoencoder in R for anomaly detection. I don’t have a training dataset, so I am using the data.hex to train the model, and then the same data.hex to calculate the reconstruction errors. The rows in data.hex with the largest reconstruction errors are considered anomalous. Mean squared error (MSE) of the model, which is calculated by the model itself, would be the sum of the squared reconstruction errors and then divided by the number of rows (i.e. examples). Below is some sudo code of the model.
# Deeplearning Model
model.dl <- h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
# Anomaly Detection Algorithm
errors <- h2o.anomaly(model.dl, data.hex, per_feature = FALSE)
Currently there are about 10 features (factors) in my data.hex, and they are all categorical features. I have two questions below:
(1) Do I need to perform feature selection to select a subset of the 10 features before the data go into the deep learning model (with autoencoder=TRUE), in case some features are significantly associated with each other? Or I don’t need to since the data will go into an autoencoder which compresses the data and selects only the most importance information already, so feature selection would be redundant?
(2) The purpose of using the H2O autoencoder here is to identify the senders in data.hex whose action is anomalous. Here are two examples of data.hex. Example B is a transformed version of Example A, by concatenating all the actions for each sender-receiver pair in Example A.
After running the model on data.hex in Example A and in Example B separately, what I got is
(a) MSE from Example A (~0.005) is 20+ times larger than MSE from Example B;
(b) When I put the reconstruction errors in ascending order and plot them (so errors increase from left to right in the plot), the reconstruction error curve from Example A is steeper (e.g. skyrocketing) on the right end, while the reconstruction error curve from Example B increases more gradually.
My question is, which example of data.hex works better for my purpose to identify anomalies?
Thanks for your insights!
Question 1
You shouldn't need to decrease the number of inputted features into the model. I can't say I know what would happen during training, but collinear/associated features could be eliminated in the hidden layers as you said. You could consider adjusting your hidden nodes and see how it behaves. hidden = c(25,25,25) -> hidden = c(25,10,25) or hidden = c(15,15) or even hidden = c(7, 5, 7) for your few features.
Question 2
What is the purpose of your model? Are you trying to determine which "Sender/Receiver combinations" are anomalies or are you trying to determine which "Sender/Receiver + specific Action combo" are anomalies? If it's the former ("Sender/Receiver combinations") I would guess Example B is better.
If you want to know "Sender/Receiver combinations" and use Example A, then how would you aggregate all the actions for one Sender-Receiver combo? Will you average their error?
But it sounds like Example A has more of a response for anomalies in ascended order list (where only a few rows have high error). I would sample different rows and see if the errors make sense (as a domain expert). See if higher errors tend to seem to be anomaly-like rows.

Distribution of the Training Data vs Distribution of the Test/Prediction

Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

Matching data based on parameters and constraints

I've been looking into the k nearest neighbors algorithm as I might be developing an application that matches fighters (boxers) in the near future.
The reason for my question, is to figure out which would be the best approach/algorithm to use when matching fighters based on multiple parameters and constraints depending on the rule-set.
The relevant properties of each fighter are the following:
Age (Fighters will be assigned to an agegroup (15, 17, 19, elite)
Weight
Amount of fights
Now there are some rulesets for what can be allowed when matching fighters:
A maximum of 2 years in between the fighters (unless it's elite)
A maximum of 3 kilo's difference in weight
Now obviously the perfect match, would be one where all the attendees gets matched with another boxer that fits within the ruleset.
And the main priority is to match as many fighters with each other as possible.
Is K-nn the way to go or is there a better approach?
If so which?
This is too long for a comment.
For best results with K-nn, I would suggest principal components. These allow you to use many more dimensions and do a pretty good job of spreading the data through the space, to get a good neighborhood.
As for incorporating existing rules, you have two choices. Probably, the best way is to build it into you distance function. Alternatively, you can take a large neighborhood and build it into the combination function.
I would go with k-Nearest Neighbor search. Since your dataset is in a low dimensional space (i.e. 3), I would use CGAL, in order to perform the task.
Now, the only thing you have to do, is to create a distance function like this:
float boxers_dist(Boxer a, Boxer b) {
if(abs(a.year - b.year) > 2 || abs(a.weight - b.weight) > e)
return inf;
// think how you should use the 3 dimensions you have, to compute distance
}
And you are done...now go fight!

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

Data filtering or better LINQ query?

I am using the new WPF toolkit's Chart to plot large data sets. I also have a crosshair tracker that follows the mouse when it's over the chart area to tell exactly what is the value of the nearest data point (see Yahoo! Finance charts).
I use the following code to find the closest data point that is lower (or equal) to where the mouse is currently hovering (the nasty detail about the chart is that it actually interpolates the data to tell you what's the EXACT value where you hove your mouse over, even though the mouse is located between the data points):
TimeDataPoint point = mainSeries.Find(
new Predicate<TimeDataPoint>(
delegate(TimeDataPoint p) {
return xValue > p.Date && !mainSeries.Exists(new Predicate<TimeDataPoint>(
delegate(TimeDataPoint middlePoint) {
return middlePoint.Date > p.Date && xValue > middlePoint.Date;
}));
}));
[Here, mainSeries is simply a List<TimeDataPoint>]
This works very well for relatively small data sets, but once I go up to 12000+ points (this will increase rapidly), the code above slows down to a standstill (it does a run through data 12000+^2 times).
I am not very good at constructing queries so I am wondering if it is possible to use a better LINQ query to do this.
EDIT: Another idea that was inspired by #Randolpho comment is this: I will search for all points that are lower than given (this will be at most n (here: 12,000+)) and then select a Max<> (which should be also at most O(n)). This should produce the same result but only with order of n operations and thus should be at least a little bit faster...
My other alternative is to actually filter the data set and maintain an upper bound on the number of points depending on the level of details the user wants to see. I would rather not go down that road if there's a possibility of having a more efficient query.
Pre-compute the closest data points based on the known resolution of display/chart. Then, when you hover over a point, it's a simple lookup of the x/y coordinates against the known pre-computed value.
For performance reasons, do your pre-computation in a separate thread and do not allow the display of those values until the computation is completed. Re-compute every time the size of the chart is changed.
Bottom line: There is no LINQ query that will help you execute every time you do a mouse-over for large data sets. It just can't be done. You're looking at order N^2 no matter what. So pre-compute it and cache it, so you only do your computations once.
This is an intriguing idea but wouldn't I still need to do a look-up of x/y among 12000+ pairs? Could you elaborate on how I should store the pre-computed x/y pairs for a fast look-up? For example, I have data points at (200,300) and (250, 300) and user's mouse is at (225, 300). – Alexandra
Well, I guess that would depend on the graph. Based on your code and your mention of Yahoo Finance Charts, I'm assuming your data only varies by horizontal postion, i.e. for a given X value, you are computing the display data.
In that case, you can a simple Dictionary<int, TimeDataPoint> as your cache. The Key is the transformed X coordinate (i.e. in the coordinate space of your display graph), the Value is the pre-computed TimeDataPoint. The dictionary would have a record for every X coordinate in your display graph, so a 400-pixel-wide graph has 400 pre-computed data points.
If your data varies against both axes, you could instead use Dictionary<System.Windows.Point, TimeDataPoint>, in pretty much the same way, but this would increase the number of items in your Dictionary by an order of magnitude. A 400 by 300 graph would have 120000 entries in the dictionary, so the tradeoff is a higher memory footprint.
Pre-calculating your data is the tricky part; it'd have to be done differently from the way you're currently doing it. I'm going to assume here that xValue in your example is an interpolation of a Date based on the X value, since it's compared to p.Date.
This might work:
private Dictionary<int, TimeDataPoint> BuildCache(List<TimeDataPoint> mainSeries)
{
int xPrevious = 0;
int xCurrent = 0;
Dictionary<int, TimeDataPoint> cache = new Dictionary<int, TimeDataPoint>();
foreach(var p in mainSeries)
{
xCurrent = XFromDate(p.Date);
for(int val = xPrevious; val < xCurrent; val++)
{
cache.Add(val, p);
}
xPrevious = xCurrent;
}
return cache;
}
XFromDate would extract the X coordinate for a particular date. I'll leave doing that up to you. :)

Resources