DBSCAN Algorithm outliers - outliers

In DBSCAN algorithm, Outliers are often discarded as noise but some applications these noisy data can be more interesting than the more regularly occurring ones. why ?

The points marked as outliers aren't discarded as such, they are just points not in any cluster. You can still inspect the set of non-clustered points and try to interpret them.
DBSCAN is designed to give clusters without any knowledge of how many clusters there are or what shape they are. It does this by iteratively expanding clusters from starting points in sufficiently dense regions. Outliers are just the points that are in sparsley populated regions (as defined by the eps and minPoints parameters).
In practice, it takes some care to choose parameters that won't include those outliers. If they are included in clusters they often act as a bridge between clusters and cause them to merge together into an analytically useless blob.

Cluster points are similar. They have the same properties, and tell the same story, and may be redundant.
Noise points (DBSCAN is not good at detecting actual outliers!) are all those data points that don't cluster.
You may even consider thrse data points to be normal data, because they do not cluster.
For detecting actual outliers (errors, or particularly interesting objects), use specialized outlier detection algorithms.

Related

Does it make sense to randomly project high-dimensional data into low-dimensional ones in outliers detection?

I have some high-dimensional data from which I want to detect outliers. I know that if I'm working with low-dimensional data, I can cluster and then check if a data point belongs to a cluster, or calculate the average distance from it to its k nearest neighbors, etc. But I can't do these on high-dimensional data because of the curse of dimensions.
So I think maybe I can randomly project the high-dimensional data to lower dimensional ones, and check if the projections of a data point are outliers in most of the transformed dataset. My assumption is that an outlier in higher-dimension should also appears to be outliers in most projections to lower-dimension.
For example, randomly generate some projections from (suppose we have the curse of dimensions in ) to (where we can cluster by traditional methods), denoted by (all of them are matrices with random elements). Suppose we want to detect outliers in . If for many , is an outlier in , than is an outlier.
Does it makes sense?
The typical way to perform anomaly detection would be to perform dimensionality reduction using principle component analysis. The idea is similar to what you describe, but it uses linear algebra to make a smart choice of the exact way to perform the projection. Doing that guarantees that minimal amount of information is lost with the projection.

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Hierarchical clusterization heuristics

I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree. After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine different vectors into hierarchical clusters? I'm using heuristics: if two vectors linked, and distance between them is very small - that means that they are in the same cluster, if two wectors are linked but distance between them is larger than threshold - that means that they are in different clusters with common root cluster.
But maybe there is better solution?
Thanks
P.S.
Thanks to all!
In fact I've tried to use k-means and some variation of CLOPE, but didn't get good results.
So, now I'm know that clusters of my dataset actually have complex structure (much more complex than n-spheres).
Thats why I want to use hierarchical clusterisation. Also I'm guess that clusters are looks like n-dimension concatenations (like 3d or 2d chain). So I use single-link strategy.
But I'm disturbed - how to combine different clusters with each other (in which situation I've to make common root cluster, and in which situations I've to combine all sub-clusters in one cluster?).
I'm using such simple strategy:
If clusters (or vectors) are too close to each other - I'm combine their content into one cluster (regulated by threshold)
If clusters (or vectors) are too far from each other - I'm creating root cluster and put them into it
But using this strategy I've got very large cluster trees. I'm trying to find satisfactory threshold. But maybe there might be better strategy to generate cluster-tree?
Here is a simple picture, describes my question:
A lot of work has been done in this area. The usual advice is to start with K-means clustering unless you have a really good reason to do otherwise - but K-means does not do hierarchical clustering (normally anyway), so you may have a good reason to do otherwise (although it's entirely possible to do hierarchical K-means by doing a first pass to create clusters, then do another pass, using the centroid of each of those clusters as a point, and continuing until you have as few high-level clusters as desired).
There are quite a few other clustering models though, and quite a few papers covering relative strengths and weaknesses, such as the following:
Pairwise Clustering and Graphical Models
Beyond pairwise clustering
Parallel pairwise clustering
A fast greedy pairwise distance clustering. algorithm and its use in discovering thematic. structures in large data sets.
Pairwise Clustering Algorithm
Hierarchical Agglomerative Clustering
A little Googling will turn up lots more. Glancing back through my research directory from when I was working on clustering, I have dozens of papers, and my recollection is that there were a lot more that I looked at but didn't keep around, and many more still that I never got a chance to really even look at.
There is a whole zoo of clustering algorithms. Among them, minimum spanning tree a.k.a. single linkage clustering has some nice theoretical properties, as noted e.g. at http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf. In particular, if you take a minimum spanning tree and remove all links longer than some threshold length, then the resulting grouping of points into clusters should have minimum total length of remaining links for any grouping of that size, for the same reason that Kruskal's algorithm produces a minimum spanning tree.
However, there is no guarantee that minimum spanning tree will be the best for your particular purpose, so I think you should either write down what you actually need from your clustering algorithm and then choose a method based on that, or try a variety of different clustering algorithms on your data and see which is best in practice.

What is the difference between a KD-tree and a R-tree?

I looked at the definition of KD-tree and R-tree. It seems to me that they are almost the same.
What's the difference between a KD-tree and an R-tree?
They are actually quite different. They serve similar purpose (region queries on spatial data), and they both are trees (and both belong to the family of bounding volume hierarchy indexes), but that is about all they have in common.
R-Trees are balanced, k-d-trees are not (unless bulk-loaded). This is why R-trees are preferred for changing data, as k-d-trees may need to be rebuilt to re-optimize.
R-Trees are disk-oriented. They actually organize the data in areas that directly map to the on-disk representation. This makes them more useful in real databases and for out-of-memory operation. k-d-trees are memory oriented and are non-trivial to put into disk pages
k-d-trees are elegant when bulk-loaded (kudos to SingleNegationElimination for pointing this out), while R-trees are better for changing data (although they do benefit from bulk loading, when used with static data).
R-Trees do not cover the whole data space. Empty areas may be uncovered. k-d-trees always cover the whole space.
k-d-trees binary split the data space, R-trees partition the data into rectangles. The binary splits are obviously disjoint; while the rectangles of an R-tree may overlap (which actually is sometimes good, although one tries to minimize overlap)
k-d-trees are a lot easier to implement in memory, which actually is their key benefit
R-trees can store rectangles and polygons, k-d-trees only stores point vectors (as overlap is needed for polygons)
R-trees come with various optimization strategies, different splits, bulk-loaders, insertion and reinsertion strategies etc.
k-d-trees use the one-dimensional distance to the separating hyperplane as bound; R-trees use the d-dimensional minimum distance to the bounding hyperrectangle for bounding (they can also use the maximum distance for some counting queries, to filter true positives).
k-d-trees support squared Euclidean distance and Minkowski norms, while Rtrees have been shown to also support geodetic distance (for finding near points on geodata).
R-trees and kd-trees are based on similar ideas (space partitioning based on axis-aligned regions), but the key differences are:
Nodes in kd-trees represent separating planes, whereas nodes in R-trees represent bounding boxes.
kd-trees partition the whole of space into regions whereas R-trees only partition the subset of space containing the points of interest.
kd-trees represent a disjoint partition (points belong to only one region) whereas the regions in an R-tree may overlap.
(There are lots of similar kinds of tree structures for partitioning space: quadtrees, BSP-trees, R*-trees, etc. etc.)
A major difference between the two not mentioned in this answer is that KD-trees are only efficient in bulk-loading situations. Once built, modifying or rebalancing a KD-tree is non-trivial. R-trees do not suffer from this.

How to cluster objects (without coordinates)

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):
class Thing {
public double DistanceTo(Thing other);
}
I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:
List<Cluster> cluster(int numClusters, List<Thing> things);
Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?
Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.
Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.
I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.
Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.
Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:
PAM is the basic O(n^2) implementation of K-medoids.
CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.
If you need more information, here's a paper that gives an overview of these and other K-medoids methods.
Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.
Determine the distance between all objects. Record the n most separate objects. [finds roots of our clusters, time O(n^2)]
Assign each of these n random points to n new distinct clusters.
For every other object:[assign objects to clusters, time O(n^2)]
For each cluster:
Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
Assign the object to the closest cluster.
This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.
Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.
Here's a quick algorithm.
While (points_left > 0) {
Select a random point that is not already clustered
Add point and all points within x distance
that aren't already clustered to a new cluster.
}
Alternatively, read the wikipedia page. K-means clustering is a good choice:
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
The algorithm steps are:
* Choose the number of clusters, k.
* Randomly generate k clusters and determine the cluster centers, or
directly generate k random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is
met (usually that the assignment hasn't changed).
The main advantages of this algorithm
are its simplicity and speed which
allows it to run on large datasets.
Its disadvantage is that it does not
yield the same result with each run,
since the resulting clusters depend on
the initial random assignments. It
minimizes intra-cluster variance, but
does not ensure that the result has a
global minimum of variance. Another
disadvantage is the requirement for
the concept of a mean to be definable
which is not always the case. For such
datasets the k-medoids variant is
appropriate.
How about this approach:
Assign all objects to one cluster.
Find the two objects, a and b, that are within the same cluster, k, and that are maximum distance apart. To clarify, there should be one a and b for the whole set, not one a and b for each cluster.
Split cluster k into two clusters, k1 and k2, one with object a and one with object b.
For all other objects in cluster k, add them to either k1 or k2 by determining the minimum average distance to all other objects in that cluster.
Repeat steps 2-5 until N clusters are formed.
I think this algorithm should give you a fairly good clustering, although the efficiency might be pretty bad. To improve the efficiency you could alter step 3 so that you find the minimum distance to only the original object that started the cluster, rather than the average distance to all objects already in the cluster.
Phylogenetic DNA sequence analysis regularly uses hierarchical clustering on text strings, with [alignment] distance matrices. Here's a nice R tutorial for clustering:
http://www.statmethods.net/advstats/cluster.html
(Shortcut: Go straight to the "Hierarchical Agglomerative" section...)
Here are some other [language] libraries :
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
http://code.google.com/p/scipy-cluster/
This approach could help determine how many [k] "natural" clusters there are and which objects to use as roots for the k-means approaches above.

Resources