I'm wondering about Manhattan distance. It is very specific, and (I don't know if it's a good word) simple. For example when we are given a set of n points in this metric, then it is very easy to find the distance between two farthest points, in linear time. But is it also easy to find two closest points?
I heard, that there exists universal algorithm for finding two closest points in any metric, but it's complicated. I'm wondering if in this situation (Manhattan metric) it is possible to use special properties of this distance and come up with an easier algorithm, that will be more friendly in implementation?
EDIT: n points on a plane, and lets say -10^9 <= x,y <= 10^9 for all points.
Assuming you're talking about n points on a plane, find among the coordinates the minimal and maximal values of x and y coordinates. Create a matrix sized maxX-minX x maxY-minY, such that all points are representable by a cell in the matrix. Fill the matrix with the n given points (not all cells will be filled, set NaN there, for example). Scan the matrix - shortest distance is between adjacent filled cells in the matrix (there are might be several such pairs).
Related
Suppose I have some n points (in my case, 4 points) in 3 dimensions. I want to determine both the point a which minimizes the squared distance to each of these n points, as well as the largest difference that can exist between the distance from an arbitrary point b and any two of these n points (i.e. the two "farthest points").
How can this be most efficiently accomplished? I know that, in 2 dimensions and with 3 points, the solution to the point that minimized distance is the centroid of the triangle formed by the 3 points, and the solution to the largest difference can be found by taking a point located precisely at one (any?) of the 3 points. It seems that the same should be true in 3 dimensions, although I am unsure.
I want to determine both the point that minimizes distance from each of these n points
The centroid minimizes the sum of the squared distances to every point in the set. But will not minimize the max distance (the farther distance) to the points.
I suspect that you are interested in computing the center and radius of the minimal sphere containing every point in the set. This is a classic problem in CG that can be solved in linear time quite easily in an approximate way, or exactly if you program the algorithm propossed by Emmerich Welzl.
If the number of points is as small as 4, an approximate solution is search the pair of point with maximum distance (there is 12 possible pairs) and compute the midpoint as center and half-distance as radius . Then, ensure that the other two points are also inside the sphere, or make it grow if necessary.
See more information at
https://en.wikipedia.org/wiki/Bounding_sphere
https://en.wikipedia.org/wiki/Smallest-circle_problem
The largest difference between the distances of a point to two given points is achieved when the three points are aligned and the unknown point is "outside" (there are infinitely many solutions). In this configuration, the difference is just the distance between the two given points.
If you mean to maximize all differences simultaneously (or rather the sum of differences), you must go to infinity in some direction. That direction maximizes the sum of the lengths of the projections of all edges.
Consider this question relative to graph theory:
Let G a complete (every vertex is connected to all the other vertices) non-directed graph of size N x N. Two "salesmen" travel this way: the first always visits the nearest non visited vertex, the second the farthest, until they have both visited all the vertices. We must generate a matrix of distances and the starting points for the two salesmen (they can be different) such that:
All the distances are unique Edit: positive integers
The distance from a vertex to itself is always 0.
The difference between the total distance covered by the two salesmen must be a specific number, D.
The distance from A to B is equal to the distance from B to A
What efficient algorithms cn be useful to help me? I can only think of backtracking, but I don't see any way to reduce the work to be done by the program.
Geometry is helpful.
Using the distances of points on a circle seems like it would work. Seems like you could determine adjust D by making the circle radius larger or smaller.
Alternatively really any 2D shape, where the distances are all different could probably used as well. In this case you should scale up or down the shape to obtain the correct D.
Edit: Now that I think about it, the simplest solution may be to simply pick N random 2D points, say 32 bit integer coordinates to lower the chances of any distances being too close to equal. If two distances are too close, just pick a different point for one of them until it's valid.
Ideally, you'd then just need to work out a formula to determine the relationship between D and the scaling factor, which I'm not sure of offhand. If nothing else, you could also just use binary search or interpolation search or something to search for scaling factor to obtain the required D, but that's a slower method.
To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.
Given N points(in 2D) with x and y coordinates. You have to find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.
for ex. N points given p1(x1,y1),p2(x2,y2) ...... pN(xN,yN).
we have find a point P among p1 , p2 .... PN whose sum of distances from all other points is minimum.
I used brute force approach , but I need a better approach. I also tried by finding median, mean etc. but it is not working for all cases.
then I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem?
If your points are nicely distributed and if there are so many of them that brute force (calculating the total distance from each point to every other point) is unappealing the following might give you a good enough answer. By 'nicely distributed' I mean (approximately) uniformly or (approximately) randomly and without marked clustering in multiple locations.
Create a uniform k*k grid, where k is an odd integer, across your space. If your points are nicely distributed the one which you are looking for is (probably) in the central cell of this grid. For all the other cells in the grid count the number of points in each cell and approximate the average position of the points in each cell (either use the cell centre or calculate the average (x,y) for points in the cell).
For each point in the central cell, compute the distance to every other point in the central cell, and the weighted average distance to the points in the other cells. This will, of course, be the distance from the point to the 'average' position of points in the other cells, weighted by the number of points in the other cells.
You'll have to juggle the increased accuracy of higher values for k against the increased computational load and figure out what works best for your points. If the distribution of points across cells is far from uniform then this approach may not be suitable.
This sort of approach is quite widely used in large-scale simulations where points have properties, such as gravity and charge, which operate over distances. Whether it suits your needs, I don't know.
The point in consideration is known as the Geometric Median
The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each sample, can be found by a simple formula — its coordinates are the averages of the coordinates of the samples but no such formula is known for the geometric median, and it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots can exist in general.
I'm not sure if I understand your question but when you calculate the minimum spanning tree the sum from any point to any other point from the tree is minimum.
Let's say we want to Voronoi-partition a rectangular surface with N points.
The Voronoi tessellation results in N regions corresponding to the N points.
For each region, we calculate its area and divide it by the total area of the whole surface - call these numbers a1, ..., aN. Their sum equals unity.
Suppose now we have a preset list of N numbers, b1, ..., bN, their sum equaling unity.
How can one find a choice (any) of the coordinates of the N points for Voronoi partitioning, such that a1==b1, a2==b2, ..., aN==bN?
Edit:
After a bit of thinking about this, maybe Voronoi partitioning isn't the best solution, the whole point being to come up with a random irregular division of the surface, such that the N regions have appropriate sizes. Voronoi seemed to me like the logical choice, but I may be mistaken.
I'd go for some genetic algorithm.
Here is the basic process:
1) Create 100 sets of random points that belong in your rectangle.
2) For each set, compute the voronoï diagram and the areas
3) For each set, evaluate how well it compares with your preset weights (call it its score)
4) Sort sets of points by score
5) Dump the 50 worst sets
6) Create 50 new sets out of the 50 remaining sets by mixins points and adding some random ones.
7) Jump to step 2 until you meet a condition (score above a threshold, number of occurrence, time spent, etc...)
You will end up (hopefully) with a "somewhat appropriate" result.
If what you are looking for does not necessarily have to be a Voronoi tesselation, and could be a Power diagram, there is a nice algorithm described in the following article:
F. Aurenhammer, F. Hoffmann, and B. Aronov, "Minkowski-type theorems and least-squares clustering," Algorithmica, 20:61-76 (1998).
Their version of the problem is as follows: given N points (p_i) in a polygon P, and a set of non-negative real numbers (a_i) summing to the area of P, find weights (w_i), such that the area of the intersection of the Power cell Pow_w(p_i) with P is exactly a_i. In Section 5 of the paper, they prove that this problem can be written as a convex optimization problem. To implement this approach, you need:
software to compute Power diagrams efficiently, such as CGAL and
software for convex optimization. I found that using quasi-Newton solvers such as L-BFGS gives very good result in practice.
I have some code on my webpage that does exactly this, under the name "quadratic optimal transport". However this code is not very clean nor very well-documented, so it might be as fast to implement your own version of the algorithm. You can also look at my SGP2011 paper on this topic, which is available on the same page, for a short description of the implementation of Aurenhammer, Hoffman and Aronov's algorithm.
Assume coordinates where the rectangle is axis-aligned with left edge at x = 0 and right edge at x = 1 and horizontal bisector at y = 0. Let B(0) = 0 and B(i) = b1 + ... + bi. Put points at ((B(i-1) + B(i))/2, 0). That isn't right. We the x coordinates to be xi such that bi = (x(i+1) - x(i-1)) / 2, replacing x(0) by 0 and x(n+1) by 1. This is tridiagonal and should have an easy solution, but perhaps you don't want such a boring Voronoi diagram though; it will be a bunch of vertical divisions.
For a more random-looking diagram, maybe something physics inspired: drop points randomly, compute the Voronoi diagram, compute the area of each cell, make overweight cells attractive to the points of their neighbors and underweight cells repulsive and compute a small delta for each point, repeat until equilibrium is reached.
The voronoi tesselation can be compute when you compute the minimum spanning tree and remove the longest edges. Each center of the subtree of the mst is then a point of the voronoi diagram. Thus the voronoi diagram is a subset of the minimum spanning tree.