DBSCAN Clustering with Numerical and Categorical Variables - rstudio

I just wanted to ask how to deal with Categorical variables in Clustering using DBSCAN? And what methods to be used to deal with them?

DBSCAN is based on Euclidian distances (epsilon neighborhoods). You need to transform your data so Euclidean distance makes sense. One way to do this would be to use 0-1 dummy variables, but it depends on the application.

Related

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

A clustering algorithm that accepts an arbitrary distance function

I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?
If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.
There are various clustering algorithms that can work with arbitrary distance functions, in particular:
hierarchical clustering
k-medoids (PAM)
DBSCAN
OPTICS
many many more - get some good clustering book and/or software
But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.
Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws.
If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Best algorithm to interpolate on a grid

I have a set points whose coordinates are given by the arrays x, y and z and the value of the density field in each point is stored in the array d.
I would like to reconstruct the density field on a uniform grid. What's the best algorithm to do that?
I know that in python, the scipy module come in handy with the griddata function but I would like to write my own code, I just need a hint.
If you have some sort of scalar field and the points are the origins of the field, you can implement a brute force approach by walking all lattice points and calculating the field intensity given the sources. There are both recursive methods that allow "blanking" wide volumes where the field is more or less constant, and techniques to save some CPU time by calculating the variations from one point to the next.
If the points you have are samplings of a value, then you will have to decompose your space in volumes and interpolate the values. You can employ a simple Voronoi decomposition - this is usually done in 2D for precipitation measurements - or a Delaunay tetrahedralization (you can look into TetGen's documentation). The first approach assumes that the function is constant throughout each Voronoi volume; the last allows rendering a trilinear interpolation.
If you need to smooth a 3D grid, the trilinear interpolation looks like the best approach.
There are also other methods used for fast visualization, that involve maintaining a list of 3D points in order of distance from any one given point in your regular grid. When moving through the grid, you recalculate distances using quadratic increments. Then, you perform a simple interpolation based on a subset of points of chosen cardinality (i.e., if you consider the four nearest points at distances d1..d4, you would calculate the value in P by proportionally weighing the values v1..v4). This approach is fast and easy to implement by yourself, but be warned that it underperforms wherever the minimum distance between points is less than the lattice step (you can compensate by considering more points where this happens; and the effect is less evident if the sampled function is smooth at the same scale).
If you want to implement a mathematical method yourself, you need to learn the theory, of course. In this case, it's 3D scattered data interpolation.
Wikipedia, MATLAB help and scipy help say there are at least half a dozen different methods. WP has a fairly good description of them and there's a comparison article but I strongly suggest you find something in your native language on such a terminology-intensive subject.
One approach is to form the Delaunay triangulation of the scattered points [x,y,z], (actually a tetrahedralisation in your 3d case!) and perform interpolation within each element using a linear representation of the density field, defined at the tetrahedron vertices.
To evaluate the density at each structured grid point you would (i) determine which tetrahedron the point lay within and (ii) evaluate the linear interpolant.
Forming the Delaunay triangulation is non-trivial, put there are a few good libraries that can be used for this, depending on your language of choice. One good option is CGAL.
Hope this helps.

A density based clustering library that takes distance matrix as input

Need help with finding an open/free density based clustering library that takes a distance matrix as input and returns clusters with each element within it maximum "x" distance away from each of the other elements in the clusters (basically returning clusters with specified density).
I checked out the DBSCAN algorithm, it seems to suit my needs. Any clean implementations of DBSCAN that you might no off, which can take off with a pre-computed distance matrix and output clusters with the desired density?
Your inputs will be really useful.
ELKI (at http://elki.dbs.ifi.lmu.de/ ) can load external distance matrixes, either in a binary or an Ascii format and then run distance-based clustering algorithms on it.
Certain algorithms such as k-means cannot work however, as these rely on the distance to the /mean/, which is obviously not precomputed. But e.g. DBSCAN and OPTICS work fine with precomputed distances.
I haven't tried it out yet, but I'm looking for something similar and came across this python implementation of DBSCAN:
http://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py
Matlab file exchange has an implementation which is straightforward to adapt to precomputed matrices. Just remove the call to pdist1 outside the function in your code.

Resources