Related
Given n number of points in an array with x and y co ordinates, we have to tell number of distinct circles that can be formed from taking three points at a time points.
The only logic I could think of was to take three points and find their circum centre and count the circles having common circum centre but could not write efficient code for it, can someone please help with a better logic or how to implement the same logic?
Suppose we have n points in a bounded region of the plane. The problem is to divide it in 4 regions (with a horizontal and a vertical line) such that the sum of a metric in each region is minimized.
The metric can be for example, the sum of the distances between the points in each region ; or any other measure about the spreadness of the points. See the figure below.
I don't know if any clustering algorithm might help me tackle this problem, or if for instance it can be formulated as a simple optimization problem. Where the decision variables are the "axes".
I believe this can be formulated as a MIP (Mixed Integer Programming) problem.
Lets introduce 4 quadrants A,B,C,D. A is right,upper, B is right,lower, etc. Then define a binary variable
delta(i,k) = 1 if point i is in quadrant k
0 otherwise
and continuous variables
Lx, Ly : coordinates of the lines
Obviously we have:
sum(k, delta(i,k)) = 1
xlo <= Lx <= xup
ylo <= Ly <= yup
where xlo,xup are the minimum and maximum x-coordinate. Next we need to implement implications like:
delta(i,'A') = 1 ==> x(i)>=Lx and y(i)>=Ly
delta(i,'B') = 1 ==> x(i)>=Lx and y(i)<=Ly
delta(i,'C') = 1 ==> x(i)<=Lx and y(i)<=Ly
delta(i,'D') = 1 ==> x(i)<=Lx and y(i)>=Ly
These can be handled by so-called indicator constraints or written as linear inequalities, e.g.
x(i) <= Lx + (delta(i,'A')+delta(i,'B'))*(xup-xlo)
Similar for the others. Finally the objective is
min sum((i,j,k), delta(i,k)*delta(j,k)*d(i,j))
where d(i,j) is the distance between points i and j. This objective can be linearized as well.
After applying a few tricks, I could prove global optimality for 100 random points in about 40 seconds using Cplex. This approach is not really suited for large datasets (the computation time quickly increases when the number of points becomes large).
I suspect this cannot be shoe-horned into a convex problem. Also I am not sure this objective is really what you want. It will try to make all clusters about the same size (adding a point to a large cluster introduces lots of distances to be added to the objective; adding a point to a small cluster is cheap). May be an average distance for each cluster is a better measure (but that makes the linearization more difficult).
Note - probably incorrect. I will try and add another answer
The one dimensional version of minimising sums of squares of differences is convex. If you start with the line at the far left and move it to the right, each point crossed by the line stops accumulating differences with the points to its right and starts accumulating differences to the points to its left. As you follow this the differences to the left increase and the differences to the right decrease, so you get a monotonic decrease, possibly a single point that can be on either side of the line, and then a monotonic increase.
I believe that the one dimensional problem of clustering points on a line is convex, but I no longer believe that the problem of drawing a single vertical line in the best position is convex. I worry about sets of points that vary in y co-ordinate so that the left hand points are mostly high up, the right hand points are mostly low down, and the intermediate points alternate between high up and low down. If this is not convex, the part of the answer that tries to extend to two dimensions fails.
So for the one dimensional version of the problem you can pick any point and work out in time O(n) whether that point should be to the left or right of the best dividing line. So by binary chop you can find the best line in time O(n log n).
I don't know whether the two dimensional version is convex or not but you can try all possible positions for the horizontal line and, for each position, solve for the position of the vertical line using a similar approach as for the one dimensional problem (now you have the sum of two convex functions to worry about, but this is still convex, so that's OK). Therefore you solve at most O(n) one-dimensional problems, giving cost O(n^2 log n).
If the points aren't very strangely distributed, I would expect that you could save a lot of time by using the solution of the one dimensional problem at the previous iteration as a first estimate of the position of solution for the next iteration. Given a starting point x, you find out if this is to the left or right of the solution. If it is to the left of the solution, go 1, 2, 4, 8... steps away to find a point to the right of the solution and then run binary chop. Hopefully this two-stage chop is faster than starting a binary chop of the whole array from scratch.
Here's another attempt. Lay out a grid so that, except in the case of ties, each point is the only point in its column and the only point in its row. Assuming no ties in any direction, this grid has N rows, N columns, and N^2 cells. If there are ties the grid is smaller, which makes life easier.
Separating the cells with a horizontal and vertical line is pretty much picking out a cell of the grid and saying that cell is the cell just above and just to the right of where the lines cross, so there are roughly O(N^2) possible such divisions, and we can calculate the metric for each such division. I claim that when the metric is the sum of the squares of distances between points in a cluster the cost of this is pretty much a constant factor in an O(N^2) problem, so the whole cost of checking every possibility is O(N^2).
The metric within a rectangle formed by the dividing lines is SUM_i,j[ (X_i - X_j)^2 + (Y_i-Y_j)^2]. We can calculate the X contributions and the Y contributions separately. If you do some algebra (which is easier if you first subtract a constant so that everything sums to zero) you will find that the metric contribution from a co-ordinate is linear in the variance of that co-ordinate. So we want to calculate the variances of the X and Y co-ordinates within the rectangles formed by each division. https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance gives us an identity which tells us that we can work out the variance given SUM_i Xi and SUM_i Xi^2 for each rectangle (and the corresponding information for the y co-ordinate). This calculation can be inaccurate due to floating point rounding error, but I am going to ignore that here.
Given a value associated with each cell of a grid, we want to make it easy to work out the sum of those values within rectangles. We can create partial sums along each row, transforming 0 1 2 3 4 5 into 0 1 3 6 10 15, so that each cell in a row contains the sum of all the cells to its left and itself. If we take these values and do partial sums up each column, we have just worked out, for each cell, the sum of the rectangle whose top right corner lies in that cell and which extends to the bottom and left sides of the grid. These calculated values at the far right column give us the sum for all the cells on the same level as that cell and below it. If we subtract off the rectangles we know how to calculate we can find the value of a rectangle which lies at the right hand side of the grid and the bottom of the grid. Similar subtractions allow us to work out first the value of the rectangles to the left and right of any vertical line we choose, and then to complete our set of four rectangles formed by two lines crossing by any cell in the grid. The expensive part of this is working out the partial sums, but we only have to do that once, and it costs only O(N^2). The subtractions and lookups used to work out any particular metric have only a constant cost. We have to do one for each of O(N^2) cells, but that is still only O(N^2).
(So we can find the best clustering in O(N^2) time by working out the metrics associated with all possible clusterings in O(N^2) time and choosing the best).
I am trying to solve the following question which was part of a programming contest:
PROBLEM ID : CIELLAND
Chef Ciel develops a new island with her restaurants. In the island, Ciel intends to built N restaurants, and the coordinate of the i-th restaurant will be (xi, yi). In addition, Ciel is going to create K roads, whose location is not decided yet. Each road must be a infinitely long straight line.
Let di be the distance between the i-th restaurant and the nearest road from the i-th restaurant. Ciel would like to create K roads which minimize max(d1, d2, ..., dN). Your task is to calculate the minimal value of max(d1, d2, ..., dN).
Any ideas as to how I should approach it? Also, the contest editorial is out ( http://www.codechef.com/wiki/march-2012-cook-problem-editorials ) but I cannot understand the solution.
Any help regarding the approach to be followed would be much appreciated.
At a high level, they are reformulating the problem so that it is easier to solve. By casting in the light below, they limit the number of possible lines to consider.
Problem A: There are N circles. The center of the i-th circle is (xi,
yi) and all circles have the radii R. And let we can draw X lines such
that any circle intersects with at least one line. What is the minimal
X?
To explain further, lets rephrase the problem A in words: The restaurants are sticklers for rules and there is a rule that says all restaurants must agree on a single maximum distance from the road - this'll be R. The circle created by the restaurant and R represents the place where a line needs to intersect to satisfy this requirement. The new problem asks the minimum number of roads to do this.
If this is not possible in under K roads, then something has to change. We can't add roads per the original problem, but we can modify R. This is where binary search comes in, but we have to solve problem A first.
Now, let's consider solving Problem A. At first, the lines can be
limited to common tangents to two circles. Because if a line
intersects with some (at least 2) circles, we can move the line such
that a moved line intersects with the same circles, and the moved line
is one of common tangents. If a line intersects less than 2 circles,
it is useless (but be careful of the case N = 1). There are at most 4
lines that is common tangents to two circles, so we consider at most 2
* N * (N-1) lines.
The important part is this, we need to find lines that intersect more than one circle. At most four lines from each pair of circle need be considered, check the source codes for implementation.
The next big step is the dynamic programming which find the minimum number of lines to cover all the circles. The 'mask' is a bitmask indicating which circles have been hit as each line is considered.
This solves the problem, but now we have to convert back. Remember R? We can now binary search to find the minimum R such that X<=K. In terms of my reformulation of Problem A, its the smallest distance all restaurant will agree to and still be serviced by a road
Hope that helps, tricky, but interesting problem.
You should be able to solve it as k means clustering problem. Initially seed with a bunch of lines. Then iteratively update points assignment to lines and optimalline given points.
I came across this problem wherein there are a number of houses on a 2-D grid (their coordinates are given) and we essentially have to find which house can be used as a meeting point so that the distance traveled by everyone minimizes. Let us assume that a distance along the x or y-axis takes 1 unit and a distance to the diagonal neighbors takes (say) 1.2 units.
I cannot really think of a good optimization algorithm for this.
P.S: Not a homework problem. And I am only looking for an algorithm (not code) and if possible, its proof.
P.S #2: I am not looking for the Exhaustive solution. Believe it or not, that did strike me :)
As already pointed, in case of Manhattan distance the median gives a solution. This is an obvious conclusion from the well-known fact that median minimizes the mean of absolute deviation:
E|X-c| >= E|X-median(X)| for any constant c.
And here you can find an example of the proof for discrete case:
https://stats.stackexchange.com/questions/7307/mean-and-median-properties/7315#7315
This is probably really inefficient, but loop through all the houses, then loop through all the other houses. (nested for loops) Use the distance formula to find the distance between the 2 houses. Then you have the distance between every house. One quick and easy way to find which house is the closest distance is to add everyone's walking distance together for the given house. The house with the least total walking distance is the meeting area of choice.
I have been bugged by the same problem for some time now. The solution is the obvious consensus given in earlier posts: find the median (mx, my) independently and then find the point closest in the given N points and that is the meeting place. To see why this is actually the solution you should first consider the distance.
d = sum(|xi-x|) + sum(|yi-y|) over all 1<=i<=N,
which is independent in x and y. Hence we can solve the 1-D case for x and y. I will skip over the explanation given ^^ and hence conclude that (mx,my) is the best solution if we consider all possible points.The bigger challenge is to prove that we may move from (mx, my) to the closest (xi,yi) such that (xi, yi) is one of the given points, w/o changing(increasing) the distance. The proof goes:
Consider that we have sorted x-coordinates( for sake for proof ) and
that X1<X2<...<Xn. Also Xj<mx<X(j+1) where j = N/2, now let's move mx
one step to left, that is mx' <- mx-1.
Hence d' = |X1-mx+1| + .. + |Xj-mx+1| + |X(j+1)-mx+1| + .. + |Xn-mx+1|
We know that mx-1 will increase N/2 values( for k>=j+1 and decrease
for <=j ) hence the effect is the same. Thus (mx-1, my) gives the same
solution. It means that there is a space from Xj<mx<X(j+1) and
Yj<my<Y(j+1) where the distance does not change. Thus we can find the
closest such point which is the answer.
I have ignored the subtle case of even/odd nodes, but I hope the math works out itself when you realize the basic proof.
This is my first post, do help me improve my writing skills.
Your distance metric is weird. You'd expect that travelling on the diagonal should take at least sqrt(2) ~= 1.41 times the distance of travelling along a component direction, because that's how much further it is if travelling in a straight line along the diagonal by the Pythagorean theorem.
If you insisted on a manhattan distance (no diagonals allowed), then you'd want to pick the house closest to the median(x) + median(y) of the houses.
Consider the 1D case, you have a bunch of points on a line, and you want to pick the meeting spot. For concreteness/simplicity, let's say there are 5 houses, none duplicate.
Consider what happens as the meeting spot drifts away from the median to the right. For every unit away until you pass the 4th house left to right order, 3 people have to take an additional step to the right, and 2 people have to take one less step to the left, so the cost goes up by 1. Once you pass the 4th house, then 4 people have to taken an additional step to the right, and a single person has to take one less step to the left, so the cost increases by 3. An identical argument holds as you move the meeting spot to the left from the median. Moving away from the median always increases the cost.
The argument generalizes to any number of people, with or without duplicate houses, and even across to arbitrary number of dimensions, so long as you aren't allowed to use the diagonal.
Your problem is called Optimal Meeting Point Finding.
The following paper gives an efficient approximate algorithm
http://www.cse.ust.hk/~wilfred/paper/vldb11.pdf
Well, you could brute force it. Take each house and calculate the distance to each other house. Sum the distances up for each individual house. Then just grab the house that had the lowest sum.
my question might be a little strange. I've "developed" an algorithm and don't know if there's a similar algorithm already out there.
The situation: I've got a track defined by track points (2D). The track points represent turns for instance. Between the track points there are only straight lines. Now I'm given a set of coordinates in this 2D space. I calculate the distance from the first track point to the new coordinates and the distance for the interval for the first two track points. If the distance to the measured coordinates is shorter than the distance from the first to the second track point, I'm assuming that this point lies in between this interval. I then do a linear interpolation on that. If it's bigger, I'll check with the next interval.
So it's basically taking interval distances and trying to fit them in there. I'm trying to track an object moving approximately along this track.
Does this sound familiar to somebody? Can somebody come up with a suggestion for a similiar existing algorithm?
EDIT: From what I've stated so far, I want to clarify that a position is not multiply associated to track points. Consider the fine ASCII drawing Jonathan made:
The X position is found to be within Segment 1 and 2 (S12). Now the next position is Y, which is not to be considered close enough to be on S12. I'll move on to S23, and check if it's in.
If it's in, I won't be checking S12 for any other value, because I found one in the next segment already. The algorithm "doesn't look back".
But if it doesn't find the right segment from there on, because it happenend to be to far away from the first segment, but still further away from any other segment anyhow, I will drop the value and the next position will be looked for back in S12, again.
The loop still remains a problem. Consider I get Y for S23 and then skip two or three positions (as they are too far off), I might be losing track. I could determine one position in S34 where it would be already in S56.
Maybe I can come up with some average speed to vage tell in what segment it should be.
It seems the bigger the segments are, the bigger the chance to make a right decision.
What concerns me about the algorithm you've described is that it is 'greedy' and could choose the 'wrong' track segment (or, at least, a track segment that is not the closest to the point).
Time to push ASCII art to the limits. Consider the following path (numbers represent the sequence in the list of track points), and the coordinate X (and, later, Y).
1-------------2
|
| Y
X |
5-----+-----6
| |
| |
4-----3
How are we supposed to interpret your description?
[C]alculate the distance from the first track point to the new coordinates and the distance for the interval for the first two track points. If the distance to the measured coordinates is shorter than the distance from the first to the second track point, [assume] that this point lies in between this interval; [...] [i]f it's bigger, [...] check with the next interval.
I think the first sentence means:
Calculate the distance from TP1 (track point 1) to TP2 - call it D12.
Calculate the distance from TP1 to X (call it D1X) and from TP2 to X (call it D2X).
The tricky part is the interpretation of the conditional sentence.
My impression is that if either D1X or D2X is less than D12, then X will be assumed to be on (or closest too) the track segment TP1 to TP2 (call it segment S12).
Looking at the position of X in the diagram, it is moderately clear that both D1X and D2X are smaller than D12, so my interpretation of your algorithm would interpret X as being associated with S12, yet X is clearly closer to S23 or S56 than it is to S12 (but those are discarded without even being considered).
Have I misunderstood something about your algorithm?
Thinking about it a bit: what I've interpreted your algorithm to mean is that if the point X lies within either the circle of radius D12 centred at TP1 or the circle of radius D12 centred at TP2, then you associate X with S12. However, if we also consider point Y, the algorithm I suggest you are using would also associate it with S12.
If the algorithm is refined to say MAX(D1Y, D2Y) < D12, then it does not consider Y as being related to S12. However, X is probably still considered to be related to S12 rather than S23 or S56.
The first part of this algorithm reminds me of moving through a discretised space. An example of representing such a space is the Z-order space-filling curve. I've used this technique to represent a quadtree, the data structure for an adaptive mesh refinement code I once worked on, and used an algorithm very like the one you describe to traverse the grid and determine distances between particles.
The similarity may not be immediately obvious. Since you are only concerned about interval locations, you are effectively treating all points on the interval as equivalent in this step. This is the same as choosing a space which only has discretised points - you're effectively 'snapping' your points to a grid.