clustering a set of rectangles - algorithm

I have a set of rectangles which I need to cluster together, based on euclidean distance between them.The situation is explained in the attached image. .
One possible approach is to take the center of each rectangle and cluster the center points using K means (distance function would be euclidean distance in XY plane). However, I would like to know if there is any other approach to this problem, which does not approximate a rectangle by it's central point, but also takes the actual shape of the rectangle into consideration.

Have a look at algorithms such as DBSCAN and OPTICS that can be used with arbitrary data types as long as you can define a distance between them (such as the minimum rectangle-to-rectangle distance).
K-means is probably not so good, as it is designed for point data with squared euclidean distance (= sum of squares, within-cluster-variance).

One way to formulate this problem is to look at each rectangle i, and each pair of rectangles (i,j) having distance d(i,j), and then forming a distance matrix from those. This distance measure d could be distance between rectangle centers or something more fancy like distance between closest points on rectangles.
Then, apply a clustering algorithm that takes a distance matrix as input, where you define your distance matrix D as the matrix where element (i,j) is d(i,j).
Related: Clustering with a distance matrix
Anony-Mousse's answer has some nice suggestions for algorithms you could use to cluster given the distance matrix.

We used Spectral Clustering with left_x, right_x, top_y, bottom_y coordinates as features with pretty good results.

Related

Distance from ellipses to static set of polygons

I have a static set of simple polygons (they may be nonconvex but are not self-intersecting) and a large number of query ellipses. Assume that this is all being done in 2D. I need to find the distance between each ellipse and the closest polygon to that ellipse. Distance is defined as the short distance between any two points on the ellipse and polygon respectively. If the ellipse intersects a polygon, then we can say the distance is 0 or assign some negative value.
A brute force approach would simply compute the distance between each ellipse and each polygon and return the lowest distance in O(mn) time where m is the number of polygons and n is the average number of vertices per polygon. I would like to reduce the m term here because I think I can cull the amount of polygons being considered though some spacial analysis.
I've considered a few approaches including Voronoi diagrams and R-trees and kd-trees. However, most of these seems to involve points and I'm not sure how to extend these to polygons. I think the most promising approach involves computing bounding boxes for each polygon and the ellipse and using an R-tree to find some set of nearby polygons. However, I'm not quite about the best way to find this close set of polygons. Or perhaps there's a better way that I'm overlooking.
Using bounding boxes or disks has the benefit of reducing the work to compute a distance ellipse/polygon to O(1). And it allows you to obtain a lower and an upper bound on the true distance.
Assume you use disks and also enclose the ellipse in a disk. You will need to perform a modified nearest neighbors search, that enumerates the disks such that the lower bound of their distance to the query disk is smaller than the best upper bound found so far.
This can be accelerated by means of a k-D tree (D=2) built on the disk centers. You can enhance every node in the k-D tree with the radius of the largest and smallest disks in the subtree it roots. During the search you will use this information to evaluate the bounds without knowing the exact radii of the disks, but the deeper you go in the tree, the better you know them.
Perform a search to obtain the tightest upper bound on the distance, then a second search to enumerate all the disks with a lower bound smaller than the tightest upper bound. This will reduce the number of disks to be considered.
You can also use bounding boxes, and store the min/max width/height of the boxes in the tree nodes.

Find smallest circle with maximum density of points from a given set

Given the (lat, lon) coordinates of a group of n locations on the surface of the earth, find a (lat, lon) point c, and a value of r > 0 such that
we maximize the density, d, of locations per square
mile, say, in the surface area described and contained by the circle defined by c and r.
At first I thought maybe you could solve this using linear programming. However, density depends on area depends on r squared. Quadratic term. So, I don't think problem is amenable to linear programming.
Is there a known method for solving this kind of thing? Suppose you simplify the problem to (x, y) coordinates on the Cartesian plane. Does that make it easier?
You've got two variables c and r that you're trying to find so as to maximize the density, which is a function of c and r (and the locations, which is a constant). So maybe a hill-climbing, gradient descent, or simulated annealing approach might work? You can make a pretty good guess for your first value. Just use the centroid of the locations. I think the local maximum you reach from there would be a global maximum.
Steps:
Cluster your points using a density based clustering algorithm1;
Calculate the density of each cluster;
Recursively (or iteratively) sub-cluster the points in the most dense cluster;
The algorithm has to be ignoring the outliers and making them a cluster in their own. This way, all the outliers with high density will be kept and outliers with low density will be weaned out.
Keep track of the cluster with highest density observed till now. Return when you finally reach a cluster made of a single point.
This algorithm will work only when you have clusters like the ones shown below as the recursive exploration will be resulting in similarly shaped clusters:
The algorithm will fail with awkwardly shaped clusters like this because as you can see that even though the triangles are most densely placed when you calculate the density in the donut shape, they will report a far lower density wrt the circle centered at [0, 0]:
1. One density based clustering algorithm that will work for you is DBSCAN.

Two closest points in Manhattan distance

I'm wondering about Manhattan distance. It is very specific, and (I don't know if it's a good word) simple. For example when we are given a set of n points in this metric, then it is very easy to find the distance between two farthest points, in linear time. But is it also easy to find two closest points?
I heard, that there exists universal algorithm for finding two closest points in any metric, but it's complicated. I'm wondering if in this situation (Manhattan metric) it is possible to use special properties of this distance and come up with an easier algorithm, that will be more friendly in implementation?
EDIT: n points on a plane, and lets say -10^9 <= x,y <= 10^9 for all points.
Assuming you're talking about n points on a plane, find among the coordinates the minimal and maximal values of x and y coordinates. Create a matrix sized maxX-minX x maxY-minY, such that all points are representable by a cell in the matrix. Fill the matrix with the n given points (not all cells will be filled, set NaN there, for example). Scan the matrix - shortest distance is between adjacent filled cells in the matrix (there are might be several such pairs).

find a point closest to other points

Given N points(in 2D) with x and y coordinates. You have to find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.
for ex. N points given p1(x1,y1),p2(x2,y2) ...... pN(xN,yN).
we have find a point P among p1 , p2 .... PN whose sum of distances from all other points is minimum.
I used brute force approach , but I need a better approach. I also tried by finding median, mean etc. but it is not working for all cases.
then I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem?
If your points are nicely distributed and if there are so many of them that brute force (calculating the total distance from each point to every other point) is unappealing the following might give you a good enough answer. By 'nicely distributed' I mean (approximately) uniformly or (approximately) randomly and without marked clustering in multiple locations.
Create a uniform k*k grid, where k is an odd integer, across your space. If your points are nicely distributed the one which you are looking for is (probably) in the central cell of this grid. For all the other cells in the grid count the number of points in each cell and approximate the average position of the points in each cell (either use the cell centre or calculate the average (x,y) for points in the cell).
For each point in the central cell, compute the distance to every other point in the central cell, and the weighted average distance to the points in the other cells. This will, of course, be the distance from the point to the 'average' position of points in the other cells, weighted by the number of points in the other cells.
You'll have to juggle the increased accuracy of higher values for k against the increased computational load and figure out what works best for your points. If the distribution of points across cells is far from uniform then this approach may not be suitable.
This sort of approach is quite widely used in large-scale simulations where points have properties, such as gravity and charge, which operate over distances. Whether it suits your needs, I don't know.
The point in consideration is known as the Geometric Median
The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each sample, can be found by a simple formula — its coordinates are the averages of the coordinates of the samples but no such formula is known for the geometric median, and it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots can exist in general.
I'm not sure if I understand your question but when you calculate the minimum spanning tree the sum from any point to any other point from the tree is minimum.

How to perform spatial partitioning in n-dimensions?

I'm trying to design an implementation of Vector Quantization as a c++ template class that can handle different types and dimensions of vectors (e.g. 16 dimension vectors of bytes, or 4d vectors of doubles, etc).
I've been reading up on the algorithms, and I understand most of it:
here and here
I want to implement the Linde-Buzo-Gray (LBG) Algorithm, but I'm having difficulty figuring out the general algorithm for partitioning the clusters. I think I need to define a plane (hyperplane?) that splits the vectors in a cluster so there is an equal number on each side of the plane.
[edit to add more info]
This is an iterative process, but I think I start by finding the centroid of all the vectors, then use that centroid to define the splitting plane, get the centroid of each of the sides of the plane, continuing until I have the number of clusters needed for the VQ algorithm (iterating to optimize for less distortion along the way). The animation in the first link above shows it nicely.
My questions are:
What is an algorithm to find the plane once I have the centroid?
How can I test a vector to see if it is on either side of that plane?
If you start with one centroid, then you'll have to split it, basically by doubling it and slightly moving the points apart in an arbitrary direction. The plane is just the plane orthogonal to that direction.
But you don't need to compute that plane.
More generally, the region (i) is defined as the set of points which are closer to the centroid c_i than to any other centroid. When you have two centroids, each region is a half space, thus separated by a (hyper)plane.
How to test on a vector x to see on which side of the plane it is? (that's with two centroids)
Just compute the distance ||x-c1|| and ||x-c2||, the index of the minimum value (1 or 2) will give you which region the point x belongs to.
More generally, if you have n centroids, you would compute all the distances ||x-c_i||, and the centroid x is closest to (i.e., for which the distance is minimal) will give you the region x is belonging to.
I don't quite understand the algorithm, but the second question is easy:
Let's call V a vector which extends from any point on the plane to the point-in-question. Then the point-in-question lies on the same side of the (hyper)plane as the normal N iff V·N > 0

Resources