So I'm looking to apply a clustering algorithm to the earth data provided by the usgs.
http://earthquake.usgs.gov/earthquakes/feed/
My main goal is to determine the top 10 most dangerous places (either by amount of earthquakes or the magnitude of an earthquake that a place experiences) to be based on an earthquake feed.
Are there any suggestions on how to do it? I'm looking at k-means then just taking the sum of the k-means (with each earthquake magnitude weighted in each cluster) to look at the most dangerous clusters.
I'm also writing this in ruby as a code reference.
Thanks
K-means can't handle outliers in the data set very well.
Furthermore, it is designed around variance, but variance in latitude and longitude is not really meaningful. In fact, k-means cannot handle the latitude +-180° wrap-around. Instead, you will want to use the great-circle distance.
So try to use a density based clustering algorithm that allows you to use distances such as the great-circle distance!
Read up on Wikipedia and a good book on cluster analysis.
Related
I am making use of the ELKI library to perform some distance measure between features.
Among other features, I am planing to implement Tamura features. From the research that I have done, this algorithm return a vector that represents three 'unrelated' features. (1st element: coarseness, 2nd element: contrast, 3rd-18th element: directional). Shall the distance between two tamura feature vectors be measured as a whole OR is it better for the distance between these three features to be measured independently (possible with different distance functions)?
Besides I read that Chisqaure and Quadratic-form distance are good algorithms to measure distance between histograms since they utilizes information across bins to retrieve more perceptually desirable results. However, I am still not sure whether such algorithms are adequate to measure the directionality histogram part of the Tamura feature. Can someone suggest a good distance function for such situation?
Thanks!
I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.
I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?
If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.
There are various clustering algorithms that can work with arbitrary distance functions, in particular:
hierarchical clustering
k-medoids (PAM)
DBSCAN
OPTICS
many many more - get some good clustering book and/or software
But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.
Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws.
If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.
Need help with finding an open/free density based clustering library that takes a distance matrix as input and returns clusters with each element within it maximum "x" distance away from each of the other elements in the clusters (basically returning clusters with specified density).
I checked out the DBSCAN algorithm, it seems to suit my needs. Any clean implementations of DBSCAN that you might no off, which can take off with a pre-computed distance matrix and output clusters with the desired density?
Your inputs will be really useful.
ELKI (at http://elki.dbs.ifi.lmu.de/ ) can load external distance matrixes, either in a binary or an Ascii format and then run distance-based clustering algorithms on it.
Certain algorithms such as k-means cannot work however, as these rely on the distance to the /mean/, which is obviously not precomputed. But e.g. DBSCAN and OPTICS work fine with precomputed distances.
I haven't tried it out yet, but I'm looking for something similar and came across this python implementation of DBSCAN:
http://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py
Matlab file exchange has an implementation which is straightforward to adapt to precomputed matrices. Just remove the call to pdist1 outside the function in your code.
I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/