I'm new to matlab and I want to know how to perform k-means algorithm in MATLAB, and also I want to know how to define cluster centers when performing k means.
For example, let's say I'm creating an array as given below.
m = [11.00; 10.60; 11.00; 12.00; 11.75; 11.55; 11.20; 12.10; 11.60; 11.40;...
11.10; 11.60; 11.00; 10.15; 9.55; 9.35; 8.55; 7.65; 7.80; 8.45; 8.05]
I want to cluster the above values into 5 clusters where k = 5.
And I would like to take the cluster centers as 2, 5, 10, 20, 40.
So my question is how can I define the cluster centers and perform k-means algorithm in MATLAB?
Is there a specific parameter to set the cluster centers in MATLAB kmeans() function?
Please help me to solve the above problems.
The goal of k-means clustering is to find the k cluster centers to minimize the overall distance of all points from their respective cluster centers.
With this goal, you'd write
[clusterIndex, clusterCenters] = kmeans(m,5,'start',[2;5;10;20;40])
This would adjust the cluster centers from their start position until an optimal position and assignment had been found.
If you instead want to associate your points in m to fixed cluster centers, you would not use kmeans, but calculate the clusterIndex directly using min
distanceToCenter = bsxfun(#minus,m,[2 5 10 20 40]);
[~, clusterIndex] = min(abs(distanceToCenter),[],2);
ie. you calculate the difference between every point and every center, and find, for each point, the center with the minimum (absolute) distance.
To plot results, you can align the points on a line and color them according to the center:
nCenters = length(clusterCenters);
cmap = hsv(nCenters);
figure
hold on
for iCenter = 1:nCenters
plot(m(iCenter==clusterIndex),1,'.','color',cmap(iCenter,:));
plot(clusterCenters(iCenter),1,'*','color',cmap(iCenter,:));
end
You can see a classic implementation in my post here - K-Means Algorithm with Arbitrary Distance Function Matlab (Chebyshev Distance).
Enjoy...
Related
I have a problem as described in the title and I am using MATLAB 2020.
I have 2 sets of 2D points, and I want to find the two points (each point from a different set)
that has the minimal distance from all the other points min(distance(pi,pj))
I done some research (google) and found this article:
"Optimal Algorithms for Computing the Minimum Distance Between Two Finite Planar Sets"
in this web page:
What is the fastest algorithm to calculate the minimum distance between two sets of points?
I tried to implement the algorithm, using MATLAB, and a code for Garbriel graph (which I found in google)
here:
http://matgeom.sourceforge.net/doc/api/matGeom/graphs/gabrielGraph.html
the problem is that when I run the code,which suppose to be the algorithm vs a "brute force algorithm" (two loops) , the brute force is always faster... no matter how many points I used , and it is way faster... which is in contrast to logic (mine) and the article mention above.
when I check the execution time of the code lines, I found that the line
dist = dist + (repmat(p1(:,i), [1 n2])-repmat(p2(:,i)', [n1 1])).^2;
in :
minDistancePoints(p1, varargin)
is the "problem"
and advises?
thank you
p.s
let
set1=random(100,2)
set2=random(100,2)
i want to find the point1 in set1 and the point2 in set2 that have minimum distance from all the other points.
Using implicit expansion, we can compute all the possible combination at once and then find point in p1 that minimize the sum of the distance:
p1 = [0 -1;
2 3;
8 8]
p2 = [1 1;
2 3;
3 5;
3 3]
[~,closest_p1] = min(sum(sum((permute(p1,[3,2,1])-p2).^2,2).^0.5))
I add a dimension to p1 with: permute(p1,[3,2,1]), so now we can compute all the combination in this new third dimension.
closest_p1 give the index of the point that minimize the sum of the euclidian distance between each points in p2. In this example closest_p1 = 2.
Noticed also that the algorithm that you use seems to also compute all the possible combination.
Given the (lat, lon) coordinates of a group of n locations on the surface of the earth, find a (lat, lon) point c, and a value of r > 0 such that
we maximize the density, d, of locations per square
mile, say, in the surface area described and contained by the circle defined by c and r.
At first I thought maybe you could solve this using linear programming. However, density depends on area depends on r squared. Quadratic term. So, I don't think problem is amenable to linear programming.
Is there a known method for solving this kind of thing? Suppose you simplify the problem to (x, y) coordinates on the Cartesian plane. Does that make it easier?
You've got two variables c and r that you're trying to find so as to maximize the density, which is a function of c and r (and the locations, which is a constant). So maybe a hill-climbing, gradient descent, or simulated annealing approach might work? You can make a pretty good guess for your first value. Just use the centroid of the locations. I think the local maximum you reach from there would be a global maximum.
Steps:
Cluster your points using a density based clustering algorithm1;
Calculate the density of each cluster;
Recursively (or iteratively) sub-cluster the points in the most dense cluster;
The algorithm has to be ignoring the outliers and making them a cluster in their own. This way, all the outliers with high density will be kept and outliers with low density will be weaned out.
Keep track of the cluster with highest density observed till now. Return when you finally reach a cluster made of a single point.
This algorithm will work only when you have clusters like the ones shown below as the recursive exploration will be resulting in similarly shaped clusters:
The algorithm will fail with awkwardly shaped clusters like this because as you can see that even though the triangles are most densely placed when you calculate the density in the donut shape, they will report a far lower density wrt the circle centered at [0, 0]:
1. One density based clustering algorithm that will work for you is DBSCAN.
I'm trying to code the livewire algorithm but I'm a little stuck because the algorithm explained in the article "Intelligent Scissors for Image Composition" is a little messy and I don't understand complety how to apply certain things for example: How to calculate de local cost map and other stuff.
So please can anyone give a hand and explain it step by step in just simple words?
I would apreciate any help
Thanks.
You should read Mortensen, Eric N., and William A. Barrett. "Interactive segmentation with intelligent scissors." Graphical models and image processing 60.5 (1998): 349-384. which contains more details about the algorithm than the shorter paper "Intelligent Scissors for Image Composition."
Here is a high-level overview:
The Intelligent Scissors algorithm uses a variant of Dijkstra's graph search algorithm to find a minimum cost path from a seed pixel to a destination pixel (the position of the mouse cursor during interactive segmentation).
1) Local costs
Each edge from a pixel p to a pixel q has a local cost, which is a linear combination of the local costs (adjusted by the distance between p and q to account for diagonal pixels):
Laplacian zero-crossing f_Z(q)
Gradient magnitude f_G(q)
Gradient direction f_D(p,q)
Edge pixel value f_P(q)
Inside pixel value f_I(q)
Outside pixel value f_O(q)
Some of these local costs are static and can be computed offline. f_Z and f_G are computed at different scales (meaning with different size kernels) to better represent the edge a pixel q. f_G, f_P, f_I, f_O are dynamically (or have a dynamic component as is the case for f_G) computed for on-the-fly training.
2) On-the-fly training
To prevent snapping to a different edge with a lower cost than the current one being followed, the algorithm uses on-the-fly training to assign a lower cost to neighboring pixels that "look like" past pixels along the current edge.
This is done by building a histogram of image value features along the last 64 or 128 edge pixels. The image value features are computed by scaling and rounding f'_G (where f_G = 1 - f'_G), f_P, f_I, and f_O as to have integer values in [0 255] or [0 1023] which can be used to index the histograms.
The histograms are inverted and scaled to compute dynamic cost maps m_G, m_P, m_I, and m_O. The idea is that a low cost neighbor q should fit in the histogram of the 64 or 128 pixels previously seen.
The paper gives pseudo code showing how to compute these dynamic costs given a list of previously chosen pixels on the path.
3) Graph search
The static and dynamic costs are combined together into a single cost to move from pixel p to one of its 8 neighbors q. Finding the lowest cost path from a seed pixel to a destination pixel is done by essentially using Dijkstra's algorithm with a min-priority queue. The paper gives pseudo code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am attaching here an image for which I need to calculate number of blobs and compute area of each blob separately. I am using Matlab for doing this.
Black regions have index value '0' and white background have index value '1'
Thank you in advance. It would be great if some one helps me in doing this.
The main problem is to find the number of blobs. For that, I'll rather use k-Means clustering. It will be too long to explain what the k-means clustering does, how does it work and so on, so I'll jump straight to the point: the k-Means algorithm groups n points into k groups (clusters). The result is a partitioned space: a given point cannot be in two clusters at the same time and a cluster is identified by its centroid (the mean point).
So let's import the image and find all x and y coordinates for black points: these are indeed the points we want to cluster.
I=imread('image.jpg');
BW = im2bw(I, graythresh(I));
[x,y]=find(BW==0);
Now it's time to trigger the k-Means algorithm in order to group such points. Since we don't know k, that is the number of blobs, we can perform some sort of bruteforce approach. We select some candidate values for k and apply k-Means clustering to all of these values. Later on, we select the best k by means of the Elbow Method: we plot the so-called Within Cluster Sum of Squares (that is the sum of all the distances between points and their respective centroid) and select the k value such that adding another cluster doesn't give much better modeling of the data.
for k=1:10
[idx{k},C{k},sumd{k}] = kmeans([x y],k,'Replicates',20);
end
sumd=cellfun(#(x) sum(x(:)),sumd);
The code above performs the k-Means for k in range [1, 10]. Since in standard k-Means the first centroids are randomly selected amongst the points in our dataset (i.e. the black points), we repeat k-Means 20 times for each value of k and then the algorithm will automatically return the best results amongst the 20 repetitions. Such results are idx that is a vector of n points (where n is the number of black points) that contains in its j-th position the centroid ID for the j-th black point. C are the centroid coordinates and sumd is the sum of squares.
We then plot the sum of squares vs the k candidates:
figure(6);
plot(1:10,sumd,'*-');
and we obtain something like:
According to the Elbow Method explained above, 6 is the optimal number of clusters: indeed after 6 the plot tends to be rather horizontal.
So from the arrays above, we select the 6th element:
best_k=6;
best_idx=idx{best_k};
best_C=C{best_k};
and the returned clusters are
gscatter(x,y,best_idx); hold on;
plot(best_C(:,1),best_C(:,2),'xk','LineWidth',2);
Note: the image is rotated because plot() handles matrices (coordinates) differently with respect to imshow(). Also black-crossed points are the centroids for each cluster.
And finally by counting the number of points per cluster, we gather the area of the cluster itself (i.e. the blob):
for m=1:best_k
Area(m)=sum(best_idx==m);
end
Area =
1619 46 141 104 584 765
Obviously the i-th item in Area is the area of the i-th cluster, as reported by the legend.
Further readings
In this Wikipedia link you can find some more details regarding the determination of the number of cluster (the "best k") in the k-Means algorithm. Amongst these methods you can find the Elbow Method as well. As #rayryeng correctly pointed out in the comments below, the Elbow plot is just an heuristic: in some datasets you cannot clearly spot a "knee" in the curve...we've been lucky though!
Last but not least, if you want to know more about the k-Means algorithm, please have a look at #rayryeng's answer linked below in the comments: it's a brilliantly detailed answer that not only describes the algorithm itself, but also talks about the repetitions I've set in the code, the initial centroid randomly selected and all these aspects I've been skipping in order to avoid an endless answer.
Can anyone suggest some clustering algorithm which can work with distance matrix as an input? Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?
At this moment I'm using a modification of Kruskal's algorithm (http://en.wikipedia.org/wiki/Kruskal%27s_algorithm) to split data into two clusters. It has a problem though. When the data has no distinct clusters the algorithm will still create two clusters with one cluster containing one element and the other containing all the rest. In this case I would rather have one cluster containing all the elements and another one which is empty.
Are there any algorithms which are capable of doing this type of clustering?
Are there any algorithms which can estimate how well the clustering was done or even better how many clusters are there in the data?
The algorithms should work only with distance(similarity) matrices as an input.
Or the algorithm which can assess the
"goodness" of the clustering also
based on the distance matrix?
KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:
Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):
Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster
Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.
Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.
For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.
Some approaches that can be used to estimate the number of clusters are:
Minimum Description Length
Bayesian Information Criterion
The gap statistic
scipy.cluster.hierarchy runs 3 steps, just like Matlab(TM)
clusterdata:
Y = scipy.spatial.distance.pdist( pts ) # you have this already
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
Here linkage might be a modified Kruskal, dunno.
This SO answer
(ahem) uses the above.
As a measure of clustering, radius = rms distance to cluster centre is fast and reasonable,
for 2d/3d points.
Tell us about your Npt, ndim, ncluster, hier/flat ?
Clustering is a largish area, one size does not fit all.