Finding Closest Pair using Manhattan Distance - algorithm

I'm trying to implement the Closest Pair algorithm with Manhattan distance. With Euclidean distance, it's working fine, but with Manhattan distance, it gives the wrong result. CLRS Exercise 33.4-3 asks us to replace Euclidean distance with Manhattan distance. They simply ask us to change one line, but it isn't what modification is needed in the below code.
lst = [(2,2),(4,2),(5,3)]
min_dist = float("inf")
for i in range(len(lst)):
for j in range(i + 1 , len(lst)):
dist = abs(lst[i][0] - lst[j][0]) + abs(lst[i][1] - lst[j][1])
if(dist < min_dist):
min_dist = dist
global minp1, minp2
minp1 = lst[i]
minp2 = lst[j]

I guess that the outcome of programs with both distances are different.
Indeed, with the euclidian distance, the closest pair is (4,2)-(5,3) while with the Manhattan distance, both (2,2)-(4,2) and (4,2)-(5,3) are closest pairs. Given your program, you only pick up the first one in the order of appearance, and the outcome is (2,2)-(4,2). If your program would return all closest pairs, you would have seen (4,2)-(5,3).
But generally speaking, there is no reason for the outcome of both programs to be the same. For example, in your example, change (5,3) to (5,3.1). To have a concrete idea of how different both distances are, it may useful for you to plot the "unit circle" using for norms and you will see that the Manhattan circle is more square than round.

Related

Correct implementation of weighted K-Nearest Neighbors

From what I understood, the classical KNN algorithm works like this (for discrete data):
Let x be the point you want to classify
Let dist(a,b) be the Euclidean distance between points a and b
Iterate through the training set points pᵢ, taking the distances dist(pᵢ,x)
Classify x as the most frequent class between the K points closest (according to dist) to x.
How would I introduce weights on this classic KNN? I read that more importance should be given to nearer points, and I read this, but couldn't understand how this would apply to discrete data.
For me, first of all, using argmax doesn't make any sense, and if the weight acts increasing the distance, than it would make the distance worse. Sorry if I'm talking nonsense.
Consider a simple example with three classifications (red green blue) and the six nearest neighbors denoted by R, G, B. I'll make this linear to simplify visualization and arithmetic
R B G x G R R
The points listed with distance are
class dist
R 3
B 2
G 1
G 1
R 2
R 3
Thus, if we're using unweighted nearest neighbours, the simple "voting" algorithm is 3-2-1 in favor of Red. However, with the weighted influences, we have ...
red_total = 1/3^2 + 1/2^2 + 1/3^2 = 1/4 + 2/9 ~= .47
blue_total = 1/2^2 ..............................= .25
green_total = 1/1^2 + 1/1^2 ......................= 2.00
... and x winds up as Green due to proximity.
That lower-delta function is merely the classification function; in this simple example, it returns red | green | blue. In a more complex example, ... well, I'll leave that to later tutorials.
Okay, off the bat let me say I am not the fan of the link you provided, it has image equations and follows a different notation in the images and the text.
So leaving that off let's look at the regular k-NN algorithm. regular k-NN is actually just a special case of weighted k-NN. You assign a weight of 1 to k neighbors and 0 to the rest.
Let Wqj denote the weight associated with a point j relative to a point q
Let yj be the class label associated with the data point j. For simplicity let us assume we are classifying birds as either crows, hens or turkeys => discrete classes. So for all j, yj <- {crow, turkey, hen}
A good weight metric is the inverse of the distance , whatever distance be it Euclidean, Mahalanobis etc.
Given all this, the class label yq you would associate with the point q you are trying to predict would be the the sum of the wqj . yj terms diviided by the sum of all weights. You do not have to the division if you normalize the weights first.
You would end up with an equation as follows somevalue1 . crow + somevalue2 . hen + somevalue3 . turkey
One of these classes will have a higher somevalue. The class witht he highest value is what you will predict for point q
For the purpose of training you can factor in the error anyway you want. Since the classes are discrete there are a limited number of simple ways you can adjust the weight to improve accuracy

Maximum minimum manhattan distance

Input:
A set of points
Coordinates are non-negative integer type.
Integer k
Output:
A point P(x, y) (in or not in the given set) whose manhattan distance to closest is maximal and max(x, y) <= k
My (naive) solution:
For every (x, y) in the grid which contain given set
BFS to find closest point to (x, y)
...
return maximum;
But I feel it run very slow for a large grid, please help me to design a better algorithm (or the code / peseudo code) to solve this problem.
Should I instead of loop over every (x, y) in grid, just need to loop every median x, y
P.S: Sorry for my English
EDIT:
example:
Given P1(x1,y1), P2(x2,y2), P3(x3,y3). Find P(x,y) such that min{dist(P,P1), dist(P,P2),
dist(P,P3)} is maximal
Yes, you can do it better. I'm not sure if my solution is optimal, but it's better than yours.
Instead of doing separate BFS for every point in the grid. Do a 'cumulative' BFS from all the input points at once.
You start with 2-dimensional array dist[k][k] with cells initialized to +inf and zero if there is a point in the input for this cell, then from every point P in the input you try to go in every possible direction. The further you are from the start point the bigger integer you put in the array dist. If there is a value in dist for a specific cell, but you can get there with a smaller amount of steps (smaller integer) you overwrite it.
In the end, when no more moves can be done, you scan the array dist to find the cell with maximum value. This is your point.
I think this would work quite well in practice.
For k = 3, assuming 1 <= x,y <= k, P1 = (1,1), P2 = (1,3), P3 = (2,2)
dist would be equal in the beginning
0, +inf, +inf,
+inf, 0, +inf,
0, +inf, +inf,
in the next step it would be:
0, 1, +inf,
1, 0, 1,
0, 1, +inf,
and in the next step it would be:
0, 1, 2,
1, 0, 1,
0, 1, 2,
so the output is P = (3,1) or (3,3)
If K is not large enough and you need to find a point with integer coordinates, you should do, as another answer suggested - Calculate minimum distances for all points on the grid, using BFS, strarting from all given points at once.
Faster solution, for large K, and probably the only one which can find a point with float coordinates, is as following. It has complexity of O(n log n log k)
Search for resulting maximum distance using dihotomy. You have to check if there is any point inside the square [0, k] X [0, k] which is at least given distance away from all points in given set. Suppose, you can check that fast enough for any distance. It is obvious, that if there is such point for some distance R, there always will be some point for all smaller distances r < R. For example, the same point would go. Thus you can search for maximum distance using binary search procedure.
Now, how to fast check for existence (and also find) a point which is at least r units away from all given points. You should draw "Manhattan spheres of radius r" around all given points. These are set of points at most r units away from given point. They are tilted by 45 degrees squares with diagonal equal to 2r. Now turn the picture by 45 degrees, and all squares will be parallel to the axis. Now you can check for existence of any point outside such squares using sweeping line algorithm. You have to sort all vertical edges of squares, and then process them one by one from left to right. Left borders will add segment mark to sweeping line, Left borders will erase it. And you have to check if there is any non marked point on the line. You can implement it using segment tree. Then, you have to check if there is any non marked point on the line inside the initial square [0,k]X[0,k].
So, again, overall solution will be binary search for r. Inside of it you will have to check if there is any point at least r units away from all given points. Do that by constructing "manhattans spheres of radius r" and then scanning them with a diagonal line from left-top corner to right-bottom. While moving line you should store number of opened spheres at each point at the line in the segment tree. between opening and closing of any spheres, line does not change, and if there is any free point there, it means, that you found it for distance r.
Binary search contributes log k to complexity. Each checking procedure is n log n for sorting squares borders, and n log k (n log n?) for processing them all.
Voronoi diagram would be another fast solution and could also find non integer answer. But it is much much harder to implement even for Manhattan measure.
First try
We can turn a 2D problem into a 1D problem by projecting onto the lines y=x and y=-x. If the points are (x1,y1) and (x2,y2) then the manhattan distance is abs(x1-x2)+abs(y1-y2). Change coordinate to a u-v system with basis U = (1,1), V = (1,-1). Coords of the two points in this basis are u1 = (x1-y1)/sqrt(2), v1= (x1+y1), u2= (x1-y1), v2 = (x1+y1). And the manhatten distance is the largest of abs(u1-u2), abs(v1-v2).
How this helps. We can just work with the 1D u-values of each points. Sort by u-value, loop through points and find the largest difference between pains of points. Do the same of v-values.
Calculating u,v coords of O(n), quick sorting is O(n log n), looping through sorted list is O(n).
Alas does not work well. Fails if we have point (-10,0), (10,0), (0,-10), (0,10). Lets try a
Voronoi diagram
Construct a Voronoi diagram
using Manhattan distance. This can be calculate in O(n log n) using https://en.wikipedia.org/wiki/Fortune%27s_algorithm
The vertices in the diagram are points which have maximum distance from its nearest vertices. There is psudo-code for the algorithm on the wikipedia page. You might need to adapt this for Manhattan distance.

Minimize maximum manhattan distance of a point to a set of points

For 3 points in 2D :
P1(x1,y1),
P2(x2,y2),
P3(x3,y3)
I need to find a point P(x,y), such that the maximum of the manhattan distances
max(dist(P,P1),
dist(P,P2),
dist(P,P3))
will be minimal.
Any ideas about the algorithm?
I would really prefer an exact algorithm.
There is an exact, noniterative algorithm for the problem; as Knoothe pointed out, the Manhattan distance is rotationally equivalent to the Chebyshev distance, and P is trivially computable for the Chebyshev distance as the mean of the extreme coordinates.
The points reachable from P within the Manhattan distance x form a diamond around P. Therefore, we need to find the minimum diamond that encloses all points, and its center will be P.
If we rotate the coordinate system by 45 degrees, the diamond is a square. Therefore, the problem can be reduced to finding the smallest enclosing square of the points.
The center of a smallest enclosing square can be found as the center of the smallest enclosing rectangle (which is trivially computed as the max and min of the coordinates). There is an infinite number of smallest enclosing squares, since you can shift the center along the shorter edge of the minimum rectangle and still have a minimal enclosing square. For our purposes, we can simply use the one whose center coincides with the enclosing rectangle.
So, in algorithmic form:
Rotate and scale the coordinate system by assigning x' = x/sqrt(2) - y/sqrt(2), y' = x/sqrt(2) + y/sqrt(2)
Compute x'_c = (max(x'_i) + min(x'_i))/2, y'_c = (max(y'_i) + min(y'_i))/2
Rotate back with x_c = x'_c/sqrt(2) + y'_c/sqrt(2), y_c = - x'_c/sqrt(2) + y'_c/sqrt(2)
Then x_c and y_c give the coordinates of P.
If an approximate solution is okay, you could try a simple optimization algorithm. Here's an example, in Python
import random
def opt(*points):
best, dist = (0, 0), 99999999
for i in range(10000):
new = best[0] + random.gauss(0, .5), best[1] + random.gauss(0, .5)
dist_new = max(abs(new[0] - qx) + abs(new[1] - qy) for qx, qy in points)
if dist_new < dist:
best, dist = new, dist_new
print new, dist_new
return best, dist
Explanation: We start with the point (0, 0), or any other random point, and modify it a few thousand times, each time keeping the better of the new and the previously best point. Gradually, this will approximate the optimum.
Note that simply picking the mean or median of the three points, or solving for x and y independently does not work when minimizing the maximum manhattan distance. Counter-example: Consider the points (0,0), (0,20) and (10,10), or (0,0), (0,1) and (0,100). If we pick the mean of the most separated points, this would yield (10,5) for the first example, and if we take the median this would be (0,1) for the second example, which both have a higher maximum manhattan distance than the optimum.
Update: Looks like solving for x and y independently and taking the mean of the most distant points does in fact work, provided that one does some pre- and postprocessing, as pointed out by thiton.

Algorithm to find the closest 3 points that when triangulated cover another point

Picture a canvas that has a bunch of points randomly dispersed around it. Now pick one of those points. How would you find the closest 3 points to it such that if you drew a triangle connecting those points it would cover the chosen point?
Clarification: By "closest", I mean minimum sum of distances to the point.
This is mostly out of curiosity. I thought it would be a good way to estimate the "value" of a point if it is unknown, but the surrounding points are known. With 3 surrounding points you could extrapolate the value. I haven't heard of a problem like this before, doesn't seem very trivial so I thought it might be a fun exercise, even if it's not the best way to estimate something.
Your problem description is ambiguous. Which triangle are you after in this figure, the red one or the blue one?
The blue triangle is closer based on lexicographic comparison of the distances of the points, while the red triangle is closer based on the sum of the distances of the points.
Edit: you clarified it to make it clear that you want the sum of distances to be minimized (the red triangle).
So, how about this sketch algorithm?
Assume that the chosen point is at the origin (makes description of algorithm easy).
Sort the points by distance from the origin: P(1) is closest, P(n) is farthest.
Start with i = 3, s = ∞.
For each triple of points P(a), P(b), P(i) with a < b < i, if the triangle contains the origin, let s = min(s, |P(a)| + |P(b)| + |P(i)|).
If s ≤ |P(1)| + |P(2)| + |P(i)|, stop.
If i = n, stop.
Otherwise, increment i and go back to step 4.
Obviously this is O(n³) in the worst case.
Here's a sketch of another algorithm. Consider all pairs of points (A, B). For a third point to make a triangle containing the origin, it must lie in the grey shaded region in this figure:
By representing the points in polar coordinates (r, θ) and sorting them according to θ, it is straightforward to examine all these points and pick the closest one to the origin.
This is also O(n³) in the worst case, but a sensible order of visiting pairs (A, B) should yield an early exit in many problem instances.
Just a warning on the iterative method. You may find a triangle with 3 "near points" whose "length" is greater than another resulting by adding a more distant point to the set. Sorry, can't post this as a comment.
See Graph.
Red triangle has perimeter near 4 R while the black one has 3 Sqrt[3] -> 5.2 R
Like #thejh suggests, sort your points by distance from the chosen point.
Starting with the first 3 points, look for a triangle covering the chosen point.
If no triangle is found, expand you range to include the next closest point, and try all combinations.
Once a triangle is found, you don't necessarily have the final answer. However, you have now limited the final set of points to check. The furthest possible point to check would be at a distance equal to the sum of the distances of the first triangle found. Any further than this, and the sum of the distances is guaranteed to exceed the first triangle that was found.
Increase your range of points to include the last point whose distance <= the sum of the distances of the first triangle found.
Now check all combinations, and the answer is the triangle found from this set with the minimal sum of distances.
second shot
subsolution: (analytic geometry basics, skip if you are familiar with this) finding point of the opposite half-plane
Example: Let's have two points: A=[a,b]=[2,3] and B=[c,d]=[4,1]. Find vector u = A-B = (2-4,3-1) = (-2,2). This vector is parallel to AB line, so is the vector (-1,1). The equation for this line is defined by vector u and point in AB (i.e. A):
X = 2 -1*t
Y = 3 +1*t
Where t is any real number. Get rid of t:
t = 2 - X
Y = 3 + t = 3 + (2 - X) = 5 - X
X + Y - 5 = 0
Any point that fits in this equation is in the line.
Now let's have another point to define the half-plane, i.e. C=[1,1], we get:
X + Y - 5 = 1 + 1 - 5 < 0
Any point with opposite non-equation sign is in another half-plane, which are these points:
X + Y - 5 > 0
solution: finding the minimum triangle that fits the point S
Find the closest point P as min(sqrt( (Xp - Xs)^2 + (Yp - Ys)^2 ))
Find perpendicular vector to SP as u = (-Yp+Ys,Xp-Xs)
Find two closest points A, B from the opposite half-plane to sigma = pP where p = Su (see subsolution), such as A is on the different site of line q = SP (see final part of the subsolution)
Now we have triangle ABP that covers S: calculate sum of distances |SP|+|SA|+|SB|
Find the second closest point to S and continue from 1. If the sum of distances is smaller than that in previous steps, remember it. Stop if |SP| is greater than the smallest sum of distances or no more points are available.
I hope this diagram makes it clear.
This is my first shot:
split the space into quadrants
with picked point at the [0,0]
coords
find the closest point
from each quadrant (so you have 4
points)
any triangle from these
points should be small enough (but not necesarilly the smallest)
Take the closest N=3 points. Check whether the triange fits. If not, increment N by one and try out all combinations. Do that until something fits or nothing does.

Efficient algorithm for finding spheres farthest apart in large collection

I've got a collection of 10000 - 100000 spheres, and I need to find the ones farthest apart.
One simple way to do this is to simply compare all the spheres to each other and store the biggest distance, but this feels like a real resource hog of an algorithm.
The Spheres are stored in the following way:
Sphere (float x, float y, float z, float radius);
The method Sphere::distanceTo(Sphere &s) returns the distance between the two center points of the spheres.
Example:
Sphere *spheres;
float biggestDistance;
for (int i = 0; i < nOfSpheres; i++) {
for (int j = 0; j < nOfSpheres; j++) {
if (spheres[i].distanceTo(spheres[j]) > biggestDistance) {
biggestDistance = spheres[i].distanceTo(spheres[j]) > biggestDistance;
}
}
}
What I'm looking for is an algorithm that somehow loops through all the possible combinations in a smarter way, if there is any.
The project is written in C++ (which it has to be), so any solutions that only work in languages other than C/C++ are of less interest.
The largest distance between any two points in a set S of points is called the diameter. Finding the diameter of a set of points is a well-known problem in computational geometry. In general, there are two steps here:
Find the three-dimensional convex hull composed of the center of each sphere -- say, using the quickhull implementation in CGAL.
Find the points on the hull that are farthest apart. (Two points on the interior of the hull cannot be part of the diameter, or otherwise they would be on the hull, which is a contradiction.)
With quickhull, you can do the first step in O(n log n) in the average case and O(n2) worst-case running time. (In practice, quickhull significantly outperforms all other known algorithms.) It is possible to guarantee a better worst-case bound if you can guarantee certain properties about the ordering of the spheres, but that is a different topic.
The second step can be done in Ω(h log h), where h is the number of points on the hull. In the worst case, h = n (every point is on the hull), but that's pretty unlikely if you have thousands of random spheres. In general, h will be much smaller than n. Here's an overview of this method.
Could you perhaps store these spheres in a BSP Tree? If that's acceptable, then you could start by looking for nodes of the tree containing spheres which are furthest apart. Then you can continue down the tree until you get to individual spheres.
Your problem looks like something that could be solved using graphs. Since the distance from Sphere A to Sphere B is the same as the distance from Sphere B to Sphere A, you can minimize the number of comparisons you have to make.
I think what you're looking at here is known as an Adjacency List. You can either build one up, and then traverse that to find the longest distance.
Another approach you can use will still give you an O(n^2) but you can minimize the number of comparisons you have to make. You can store the result of your calculation into a hash table where the key is the name of the edge (so AB would hold the length from A to B). Before you perform your distance calculation, check to see if AB or BA exists in the hash table.
EDIT
Using the adjacency-list method (which is basically a Breadth-First Search) you get O(b^d) or worst-case O(|E| + |V|) complexity.
Paul got my brain thinking and you can optimize a bit by changing
for (int j=0; j < nOfSpheres; j++)
to
for (int j=i+1; j < nOfSpheres; j++)
You don't need to compare sphere A to B AND B to A. This will make the search O(n log n).
--- Addition -------
Another thing that makes this calculation expensive is the DistanceTo calulations.
distance = sqrt((x2 - x1)^2 + (y2 - y1)^2 + (z2 - z1)^2)
That's alot of math. You can trim that down by checking to see if
((x2 - x1)^2 + (y2 - y1)^2 + (z2 - z1)^2 > maxdist^2
Removes the sqrt until the end.

Resources