Best way to test a clustering algorithm - algorithm

What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?

A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.
Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:
from numpy import *
import pylab as plt
# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))
# Add some noise
A += random.random(A.shape) * 0.1
# Make symmetric
A += A.T - diag(A.diagonal())
# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')
# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]
# Compute eigenvalues
L = linalg.eigvalsh(A)
# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))
plt.plot([c-.5,c-.5],[0,max(L)],'r--')
plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show()

It depends on what you want to test against.
When testing your own implementation of a known algorithm, you might want to compare the results with that of a known good implementation.
Hierarchical clustering is hard to test with respect to quality, as it is hierarchical. The common measures such as Rand index etc. are only valid for strict partitionings. You can get a strict partitioning from a hierarchical clustering, but then you need to fix the height to cut at.

Ideally you have some kind of pre-clustered data (supervised learning) and test the results of your clustering algorithm on that. Simply count the number of correct classifications divided by the total number of classifications performed to get an accuracy score.
If you are doing unsupervised learning, then there is really no way to evaluate your algorithm.

It is sometimes useful to construct input data where there is a known, and perhaps obvious, answer by construction. For a clustering algorithm, you might construct data with N clusters such that the maximum distance between any two points in the same cluster is smaller than the minimum distance between any two points in different clusters. Another option would be to generate a number of different data sets plotable as 2-d scatter diagrams with clusters obvious to the eye, then compare the result from your algorithm with this structure, perhaps moving the clusters together to see when the algorithm fails to see them.
You might be able to do better given knowledge of your particular clustering algorithm, but the above might at least have some chance of flushing obvious bugs from cover.

Related

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

Should k-means input contain unique values or all values (repeated as well)?

I am clustering my single dimensional data with a kmeans implementation. Although there are methods like Jenks breaks and Fishers's natural breaks for single dimensional data I still chose to go with kmeans.
My question is what difference does it make if I only cluster unique values in the list of data points I have OR if I use all data points (repetition).
What is advisable?
This can certainly make a difference: the mean of [-1 -1 1] is -.33, while the mean of [-1 1] is 0. What you should do depends on the data and what you want to do with the result of clustering. As a default, though, I'd say keep them: removing points changes the local densities that k-means is designed to pick as cluster centers, and also why would you remove duplicates, but not near-duplicates?
k-means is an optimization method which minimizes the distortion of an assignment of your data points into clusters. The distortion is the sum of within cluster sum of squares. Or, if L is a set of labels and P the set of points, if has indicates that a point has a particular label, and d is the distance between points, then
distortion = sum [ d(p1, p2)^2 | p1 <- P
, p2 <- P
, l <- L
, p1 has l and p2 has l
]
We can study the result of a successful k-means optimization by talking about this distortion. For instance, given any two points on top of one another we have the distance between them d(p1, p2) = 0 and so if they're in the same cluster then they are increasing the distortion by nothing at all. So, somewhat obviously, a good clustering will always have all point duplicates in the same cluster.
Now consider a set of 3 points like this
A ? B
---p----------q----------r---
In other words, three equidistant points, the two on the outside of different labels and the one on the inside of an unknown label. The distances (measured in -es) are d(p,q) = 10 = d(q,r) so if we label q as A we increase our distortion by 100 and same if we label it B.
If we change this situation slightly by replicating the point p then we've not increased the distortion at all (since d(p,s) = 0) but labeling q has A then we'll increase the distortion by d(p,q)^2 + d(s,q)^2 = 100 + 100 = 200 while if we label it q has B then the distortion increases only by d(q,r)^2 = 100.
A ? B
---p----------q----------r---
s
So this replication has repulsed q away from label A.
Now if you play around with k-means for a bit, you might be surprised by the analysis above. It'll turn out to be the case that adding a whole lot of replication of a single point won't really produce the linearly scaling impact it seems like it ought to.
This is because actual optimization of that metric is known to be NP-hard in almost any circumstance. If you truly want to optimize it and have n points with K labels then your best bet is to check all K^n labelings. Thus, most k-means algorithms are approximate and thus you suffer some search error between the true optimum and the result of your algorithm.
For k-means, this will be happen especially when there are lots of replicated points as these "replicated pools" still grab points according to their distance from the centroid... not actually due to their global minimization properties.
Finally, when talking about replication in machine learning algorithms it's worth noting that most machine learning algorithms are based on assumptions about data which actively preclude the idea of replicated data points. This is known broadly as "general position" and many proofs begin by assuming your data is in "general position".
The idea is that if your points are truly distributed in R^n then there's 0 probability that two points will be identical under any of the probability distributions which are "nice" enough to build algorithms atop.
What this generally means is that if you have data with a lot of replicated points, you should consider the impact of a small "smoothing" step prior to analysis. If perturbing all of your points by a small normally distributed jump does not affect the meaning of the data... then you're probably quite OK running normal ML algorithms that anticipate the data living in R^n. If not, then you should consider algorithms which better respect the structure of your data—perhaps it's better to see your data as a tree and run an algorithm for ML atop structured data.

Unsupervised clustering with unknown number of clusters

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less than a threshold "T".
I do not know how many clusters exist. At the end, there may be individual vectors existing that are not part of any cluster because its euclidean distance is not less than "T" with any of the vectors in the space.
What existing algorithms / approach should be used here?
You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python's scipy.
See for example the following script:
import matplotlib.pyplot as plt
import numpy
import scipy.cluster.hierarchy as hcluster
# generate 3 clusters of each around 100 points and one orphan point
N=100
data = numpy.random.randn(3*N,2)
data[:N] += 5
data[-N:] += 10
data[-1:] -= 20
# clustering
thresh = 1.5
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")
# plotting
plt.scatter(*numpy.transpose(data), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
plt.title(title)
plt.show()
Which produces a result similar to the following image.
The threshold given as a parameter is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.
Note that there are various methods for how to compute the intra-/inter-cluster similarity, e.g. distance between the closest points, distance between the furthest points, distance to the cluster centers and so on. Some of these methods are also supported by scipys hierarchical clustering module (single/complete/average... linkage). According to your post I think you would want to use complete linkage.
Note that this approach also allows small (single point) clusters if they don't meet the similarity criterion of the other clusters, i.e. the distance threshold.
There are other algorithms that will perform better, which will become relevant in situations with lots of data points. As other answers/comments suggest you might also want to have a look at the DBSCAN algorithm:
https://en.wikipedia.org/wiki/DBSCAN
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
For a nice overview on these and other clustering algorithms, also have a look at this demo page (of Python's scikit-learn library):
http://scikit-learn.org/stable/modules/clustering.html
Image copied from that place:
As you can see, each algorithm makes some assumptions about the number and shape of the clusters that need to be taken into account. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.
The answer by moooeeeep recommended using hierarchical clustering. I wanted to elaborate on how to choose the treshold of the clustering.
One way is to compute clusterings based on different thresholds t1, t2, t3,... and then compute a metric for the "quality" of the clustering. The premise is that the quality of a clustering with the optimal number of clusters will have the maximum value of the quality metric.
An example of a good quality metric I've used in the past is Calinski-Harabasz. Briefly: you compute the average inter-cluster distances and divide them by the within-cluster distances. The optimal clustering assignment will have clusters that are separated from each other the most, and clusters that are "tightest".
By the way, you don't have to use hierarchical clustering. You can also use something like k-means, precompute it for each k, and then pick the k that has the highest Calinski-Harabasz score.
Let me know if you need more references, and I'll scour my hard disk for some papers.
Check out the DBSCAN algorithm. It clusters based on local density of vectors, i.e. they must not be more than some ε distance apart, and can determine the number of clusters automatically. It also considers outliers, i.e. points with an unsufficient number of ε-neighbors, to not be part of a cluster. The Wikipedia page links to a few implementations.
Use OPTICS, which works well with large data sets.
OPTICS: Ordering Points To Identify the Clustering Structure Closely related to DBSCAN, finds core sample of high density and expands clusters from them 1. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN
from sklearn.cluster import OPTICS
db = OPTICS(eps=3, min_samples=30).fit(X)
Fine tune eps, min_samples as per your requirement.
I want to add to moooeeeep's answer by using hierarchical clustering.
This solution work for me, though it quite "random" to pick threshold value.
By referrence to other source and test by myself, I got better method and threshold could be easily picked by dendrogram:
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
ori_array = ["Your_list_here"]
ward_array = hierarchy.ward(pdist(ori_array))
dendrogram = hierarchy.dendrogram(hierarchy.linkage(ori_array, method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
You will see the plot like this
click here.
Then by drawing the horizontal line, let say at distance = 1, the number of conjunctions will be your desire number of clusters. So here I choose threshold = 1 for 4 clusters.
threshold = 1
clusters_list = hierarchy.fcluster(ward_array, threshold, criterion="distance")
print("Clustering list: {}".format(clusters_list))
Now each value in cluster_list will be an assigned cluster-id of the corresponding point in ori_array.
You may have no solution: it is the case when the distance between any two distinct input data points is always greater than T. If you want to compute the number of clusters only from the input data, you may look at MCG, a hierarchical clustering method with an automatic stop criterion: see the free seminar paper at https://hal.archives-ouvertes.fr/hal-02124947/document (contains bibliographic references).
I needed a way to "fuzzy sort" lines from OCR output, when the output is sometimes out of order, but within blocks, the lines are usually in order. In this case, the items to sort are dictionaries, which describe words at a location 'x','y' and with size 'w','h'. The general clustering algorithms seemed like overkill and I needed to maintain the order of the items during the sort. Here, I can set the tolerance tol to be about 1/4 the line spacing, and this is called with the field being 'y'.
def fuzzy_lod_sort(lod, field, tol):
# fuzzy sort lod into bins within +/- tol
# maintain original order.
# first determine the bins.
val_list = [d[field] for d in lod]
vals_sorted = sorted(val_list)
bins_lol = []
i = 0
for j, v in enumerate(vals_sorted):
if not j:
bins_lol.append([v])
continue
cur_bin_avg = statistics.mean(bins_lol[i])
if abs(cur_bin_avg - v) <= tol:
bins_lol[i].append(v)
continue
i += 1
bins_lol.append([v])
# now sort into the bins, maintaining the original order.
# the bins will be the center of the range of 'y'.
bins = [statistics.mean(binlist) for binlist in bins_lol]
# initialize the list of bins
lolod = []
for _ in range(len(bins)):
lolod.append([])
for d in lod:
bin_idx = closest_bin_idx(bins, d[field])
lolod[bin_idx].append(d)
# now join the bins.
result_lod = []
for lod in lolod:
result_lod.extend(lod)
return result_lod
def closest_bin(bins, val):
return min(bins, key=lambda bin:abs(bin - val))
def closest_bin_idx(bins, val):
return bins.index(closest_bin(bins, val))
The trouble is that the 'y' coordinate of the OCR output are based on the outline around the word and a later word in the same line might have a 'y' coordinate that is lower than an earlier word. So a full sort by 'y' does not work. This is much like the clustering algorithm, but the intention is a bit different. I am not interested in the statistics of the data points, but I am interested in exactly which cluster each is placed and also it is important to maintain the original order.
Maybe there is some way to fuzzy sort using the sorting built-ins, and it might be an alternative to the clustering options for 1-D problems.

Random projection algorithm pseudo code

I am trying to apply Random Projections method on a very sparse dataset. I found papers and tutorials about Johnson Lindenstrauss method, but every one of them is full of equations which makes no meaningful explanation to me. For example, this document on Johnson-Lindenstrauss
Unfortunately, from this document, I can get no idea about the implementation steps of the algorithm. It's a long shot but is there anyone who can tell me the plain English version or very simple pseudo code of the algorithm? Or where can I start to dig this equations? Any suggestions?
For example, what I understand from the algorithm by reading this paper concerning Johnson-Lindenstrauss is that:
Assume we have a AxB matrix where A is number of samples and B is the number of dimensions, e.g. 100x5000. And I want to reduce the dimension of it to 500, which will produce a 100x500 matrix.
As far as I understand: first, I need to construct a 100x500 matrix and fill the entries randomly with +1 and -1 (with a 50% probability).
Edit:
Okay, I think I started to get it. So we have a matrix A which is mxn. We want to reduce it to E which is mxk.
What we need to do is, to construct a matrix R which has nxk dimension, and fill it with 0, -1 or +1, with respect to 2/3, 1/6 and 1/6 probability.
After constructing this R, we'll simply do a matrix multiplication AxR to find our reduced matrix E. But we don't need to do a full matrix multiplication, because if an element of Ri is 0, we don't need to do calculation. Simply skip it. But if we face with 1, we just add the column, or if it's -1, just subtract it from the calculation. So we'll simply use summation rather than multiplication to find E. And that is what makes this method very fast.
It turned out a very neat algorithm, although I feel too stupid to get the idea.
You have the idea right. However as I understand random project, the rows of your matrix R should have unit length. I believe that's approximately what the normalizing by 1/sqrt(k) is for, to normalize away the fact that they're not unit vectors.
It isn't a projection, but, it's nearly a projection; R's rows aren't orthonormal, but within a much higher-dimensional space, they quite nearly are. In fact the dot product of any two of those vectors you choose will be pretty close to 0. This is why it is a generally good approximation of actually finding a proper basis for projection.
The mapping from high-dimensional data A to low-dimensional data E is given in the statement of theorem 1.1 in the latter paper - it is simply a scalar multiplication followed by a matrix multiplication. The data vectors are the rows of the matrices A and E. As the author points out in section 7.1, you don't need to use a full matrix multiplication algorithm.
If your dataset is sparse, then sparse random projections will not work well.
You have a few options here:
Option A:
Step 1. apply a structured dense random projection (so called fast hadamard transform is typically used). This is a special projection which is very fast to compute but otherwise has the properties of a normal dense random projection
Step 2. apply sparse projection on the "densified data" (sparse random projections are useful for dense data only)
Option B:
Apply SVD on the sparse data. If the data is sparse but has some structure SVD is better. Random projection preserves the distances between all points. SVD preserves better the distances between dense regions - in practice this is more meaningful. Also people use random projections to compute the SVD on huge datasets. Random Projections gives you efficiency, but not necessarily the best quality of embedding in a low dimension.
If your data has no structure, then use random projections.
Option C:
For data points for which SVD has little error, use SVD; for the rest of the points use Random Projection
Option D:
Use a random projection based on the data points themselves.
This is very easy to understand what is going on. It looks something like this:
create a n by k matrix (n number of data point, k new dimension)
for i from 0 to k do #generate k random projection vectors
randomized_combination = feature vector of zeros (number of zeros = number of features)
sample_point_ids = select a sample of point ids
for each point_id in sample_point_ids do:
random_sign = +1/-1 with prob. 1/2
randomized_combination += random_sign*feature_vector[point_id] #this is a vector operation
normalize the randomized combination
#note that the normal random projection is:
# randomized_combination = [+/-1, +/-1, ...] (k +/-1; if you want sparse randomly set a fraction to 0; also good to normalize by length]
to project the data points on this random feature just do
for each data point_id in dataset:
scores[point_id, j] = dot_product(feature_vector[point_id], randomized_feature)
If you are still looking to solve this problem, write a message here, I can give you more pseudocode.
The way to think about it is that a random projection is just a random pattern and the dot product (i.e. projecting the data point) between the data point and the pattern gives you the overlap between them. So if two data points overlap with many random patterns, those points are similar. Therefore, random projections preserve similarity while using less space, but they also add random fluctuations in the pairwise similarities. What JLT tells you is that to make fluctuations 0.1 (eps)
you need about 100*log(n) dimensions.
Good Luck!
An R Package to perform Random Projection using Johnson- Lindenstrauss Lemma
RandPro

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?
Can someone please clarify the some (or all) of the above questions?
I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.
You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].
Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.
LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.
Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.
Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.
A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].
LSH has been applied seemingly everywhere. You may want to give it a try.
[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.
[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.
[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.
[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.
I. The Distance Metric
First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:
the underlying statistical
distribution of your data;
the relationship among the features
that comprise your data (are they
independent--i.e., what does the
covariance matrix look like); and
the coordinate space from which your
data was obtained.
If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.
YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.
It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)
But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.
One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):
X_new = (X_old - mu) / sigma
II. The Data Structure
If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.
This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)
Using a VT for kNN works like this::
From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.
How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).
So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.
III. Calculating the Result (the predicted response variable)
There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.
W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.
The second component is how to weight the contribution of each of the neighbors (assuming n > 1).
The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.
A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:
def weight_gauss(dist, sig=2.0) :
return math.e**(-dist**2/(2*sig**2))
To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.
What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.
Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )
The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.
Top answers are good but old, so I'd like to add up a 2016 answer.
As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.
Hot topics that might be worthy:
Modern approaches of LSH, such as Razenshteyn's.
RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
or in a more recent approach I was part of, kd-GeRaF.
LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.
You can also check my relevant answers:
Two sets of high dimensional points: Find the nearest neighbour in the other set
Comparison of the runtime of Nearest Neighbor queries on different data structures
PCL kd-tree implementation extremely slow
To answer your questions one by one:
No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
KD tree is bad for high dimensional data ... avoid it by all means
Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.
I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.
Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.
The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.
The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.
kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.
But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.
For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)
I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!
A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.
I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.
iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.
I've experienced the same problem and can say the following.
Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.
The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.
Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.
KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.
(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)
Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.253
kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131 max 0.245
You could try a z order curve. It's easy for 3 dimension.
I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy
This is some example code for the Python API, which is optimized in C++.
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.
Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.
Hope this is helpful.
Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:
A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

Resources