Clustering a sparse matrix along diagonal line using row permutations only - algorithm

Given the binary pattern of a square sparse matrix. How can you move all non-zero elements towards diagonal line through row permutations. One possible cost function is sum the two norm of distances between each non zero element and the diagonal line.

This is more of an algorithms question, but one simple way would be to take the greedy approach. Keep evaluating which row swap would result in maximal improvement in your cost function. Repeat until the cost function stabilizes.

Related

Find a region with maximum sum of top-K points

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).
You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?
An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

Find an algorithm that minimize the maximum distance of two sets, better than Greedy algorithm

Here is the interesting but complicated problem:
Suppose we have two sets of points. One set A includes points in some space grid, like regular 1D or 3D grid. The other set B includes points that are randomly spaced and are of the same size as the space grid. Mathematically, we could order the two sets and construct a corresponding matrix with respect to the distance between A and B. For example, A(i, j) may refer to the distance between i of A and j of B.
Given some ordering, we have a matrix. Then, the diagonal element (i,i) in the matrix is the distance between point i of A and point i of B. The problem is how to find a good reordering/indexing such that the maximum distance is as small as possible? In matrix form, how to find a good reordering/indexing such that the largest diagonal element as small as possible?
Notes from myself:
Suppose set A is corresponding to rows of the matrix, and set B is to columns of the matrix. Then reordering the matrix means we are doing row/column permutation. Therefore, our problem is equivalent to find a good permutation to minimize the largest diagonal element.
Greedy algorithm may be a choice. But I am trying to find an ideally perfect reordering that minimize the largest diagonal element.
The reordering you are referring to is essentially a correspondence problem i.e. you are trying to find the closest match for each point in the other set. The greedy algorithm will work fine. The distance you are looking for is commonly referred to as the Hausdorff distance.

Best matching in a bipartite graph (e.g.associating labels with points on a plot)

I am trying to extract semantics from graphical xy plots where the points are plotted and some or all have a label. The label is plotted "near the point" so that a human can normally understand which label goes with which point. For example in this plot it is clear which label(number) belongs to which point(*) and an algorithm based on Euclidian distance would work. (The labels and points have no semantic ordering - e.g. a scatterplot)
*1
*2
*3
*4
In congested plots the authoring software/human may place the label in different directions to avoid overlap. For example in
1**2
**4
3
A human reader can normally work out which label is associated with which label.
One solution I'd accept would be to create a Euclidean distance matrix and shuffle the rows to get the minimum of a function (e.g. the summed squares of the distances on the diagonal or other heuristic). In the second example (with the points labelled a,b,c,d clockwise from the NW corner) we have a distance matrix (to 1 d.p.)
a b c d
1ab2 1 1.0 2.0 2.2 1.4
dc4 2 2.0 1.0 1.4 2.2
3 3 2.0 2.2 1.4 1.0
4 2.2 1.4 1.0 2.0
and we need to label a1 b2 c4 d3. Swapping rows 3 and 4 gives the minimum sum of the diagonal. Here's a more complex example where simply picking the nearest may fail
*1*2*5
**4
3 *6
If this is solved then I shall need to go to cases where the number of labels may be smaller or larger than the number of points.
If the algorithm is standard than I would appreciate a pointer to Open Source Java (e.g. JAMA or Apache maths)
NOTE: This SO answer Associating nearby points with a path doesn't quite work as an answer because the path through the points is given.
You have a complete bipartite graph that one part is numbers and other one is points. Weight's of edge in this graph is euclidean distance between numbers and points. And you're task is finding matching with minimal weight.
This is known problem and has a well known algorithm named as Hungarian Algorithm:
From Wiki:
We are given a nonnegative n×n matrix, where the element in the i-th
row and j-th column represents the cost of assigning the j-th point to
the i-th number. We have to find an assignment of the point to the
numbers that has minimum cost. If the goal is to find the assignment
that yields the maximum cost, the problem can be altered to fit the
setting by replacing each cost with the maximum cost subtracted by the
cost.
The algorithm is easier to describe if we formulate the problem using
a bipartite graph. We have a complete bipartite graph G=(S, T; E) with
n number vertices (S) and n point vertices (T), and each edge has a
nonnegative cost c(i,j). We want to find a perfect matching with
minimum cost. The Hungarian method is a combinatorial optimization
algorithm which solves the assignment problem in polynomial time and
which anticipated later primal-dual methods. f
For detailed algorithm and code you can take a look at topcoder article
and this pdf maybe to use
there is a media file to describe it.
(This video explains why the Hungarian algorithm works)
algorithm :
step 1:- prepare a cost matrix.if the cost matrix is not a square
matrix then add a dummy row(column) with zero cost element.
step 2:- subtract the minimum element in each row from all the
elements of the respective rows.
step 3:- further modify the resulting matrix by subtracting the
minimum elememnt of each column from all the elements of the
respective columns.thus obtain the modified matrix.
step 4:- then,draw minimum no of horizontal and vertical lines to
cover all zeros in the resulting matrix.let the minimum no of lines be
N.now there are 2 possible cases.
case 1 - if N=n,where n is the order of matrix,then an optimal
assignment can be made.so make the assignment to get the required
solution.
case 2 - if N less than n then proceed to step 5
step 5: determine the smallest uncovered element in the
matrix(element not covered by N lines).subtract this minimum element
from all uncovered elements and add the same elements at the
intersection of horizontal and vertical lines.thus the second modified
matrix is obtained.
step 6:- repeat step(3) and (4) untill we get the case (1) of step 4.
step 7:- (to make zero assignments) examine the rows successively
untill a row-wise exactly single zero is found.circle(o) this zero to
make the assignment.then mark a cross(x) over all zeros if lying in
the column of the circled zero,showing that they can't be considered
for future assignment.continue in this manner untill all the zeros
have been examined. repeat the same procedure for column also.
step 8:- repeat the step 6 succeccively until one of the following
situation arises- (i)if no unmarked zeros is left,then the process
ends or (ii) if there lies more than one of the unmarked zero in any
column or row then,circle one of the unmarked zeros arbitrarily and
mark a cross in the cell of remaining zeros in its row or
column.repeat the process untill no unmarked zero is left in the
matrix.
step 9:- thus exactly one marked circled zero in each row and each
column of the matrix is obtained. the assignment corresponding to
these marked circle zeros will give the optimal assignment.
For details see wiki and http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf
If I have understood your question, each of the examples you show has a unique best solution that minimizes the sum of the squares of the distances between points and labels. There is an exponential number of mappings between points and labels, but perhaps you can try the following:
In polynomial time, compute the distance from each label to each point. In a general graph you would have to solve the all-pairs shortest-path problem. Here, as Mikola points out, you can just do it with a doubly nested and use the coordinate geometry: pick either the Euclidean distance or the Manhattan distance.
In polynomial time, find the minimum-cost bipartite matching between points and labels. The solution to this problem will give you a matching between points and labels that minimizes the total distance.
All algorithms (shortest paths, Euclidean distance, min-cost bipartite matching) are standard and can be found on Wikipedia.
What is slightly nonstandard is if you find more than one bipartite matching with minimum cost. If that happens, you can try them all and see if one matching minimizes the sum of distances squared. If there are still ties, I recommend you treat horizontal distance as slightly shorter than vertical distance, and run the algorithm again. If you still have ties, there may not be a unique solution, or you may want to treat "label to the right" as slightly "closer" than label to the left.
But when there is a unique solution, all-pairs shortest paths followed by miniumum-cost bipartite matching should find it.
I haven't seen a common algorithm for this case. Therefore I would solve this problem pragmantically:
Assuming that a label belonging to a point is always the nearest (maybe with others), you could orient on a region growing algorithm (see animated gif). Iterate through every point (red) for each growing step (circle around number label). The growing step is determined by the minimal distance between a point and a label.
Use temporary lists for points and labels. Each time you find a definite pair, remove the corresponding point and label. Just skip the label if there is more than one nearest point (here: label 2). By solving other point-label combinations (here: label 3), they should be mapped in another iteration.
After iterations without any progress due to overall ambiguous situations, you can define a selection method to solve them (e.g. prefer top over bottom, left over right).

Intersection of N rectangles

I'm looking for an algorithm to solve this problem:
Given N rectangles on the Cartesian coordinate, find out if the intersection of those rectangles is empty or not. Each rectangle can lie in any direction (not necessary to have its edges parallel to Ox and Oy)
Do you have any suggestion to solve this problem? :) I can think of testing the intersection of each rectangle pair. However, it's O(N*N) and quite slow :(
Abstract
Either use a sorting algorithm according to smallest X value of the rectangle, or store your rectangles in an R-tree and search it.
Straight-forward approach (with sorting)
Let us denote low_x() - the smallest (leftmost) X value of a rectangle, and high_x() - the highest (rightmost) X value of a rectangle.
Algorithm:
Sort the rectangles according to low_x(). # O(n log n)
For each rectangle in sorted array: # O(n)
Finds its highest X point. # O(1)
Compare it with all rectangles whose low_x() is smaller # O(log n)
than this.high(x)
Complexity analysis
This should work on O(n log n) on uniformly distributed rectangles.
The worst case would be O(n^2), for example when the rectangles don't overlap but are one above another. In this case, generalize the algorithm to have low_y() and high_y() too.
Data-structure approach: R-Trees
R-trees (a spatial generalization of B-trees) are one of the best ways to store geospatial data, and can be useful in this problem. Simply store your rectangles in an R-tree, and you can spot intersections with a straightforward O(n log n) complexity. (n searches, log n time for each).
Observation 1: given a polygon A and a rectangle B, the intersection A ∩ B can be computed by 4 intersection with half-planes corresponding to each edge of B.
Observation 2: cutting a half plane from a convex polygon gives you a convex polygon. The first rectangle is a convex polygon. This operation increases the number of vertices at most per 1.
Observation 3: the signed distance of the vertices of a convex polygon to a straight line is a unimodal function.
Here is a sketch of the algorithm:
Maintain the current partial intersection D in a balanced binary tree in a CCW order.
When cutting a half-plane defined by a line L, find the two edges in D that intersect L. This can be done in logarithmic time through some clever binary or ternary search exploiting the unimodality of the signed distance to L. (This is the part I don't exactly remember.) Remove all the vertices on the one side of L from D, and insert the intersection points to D.
Repeat for all edges L of all rectangles.
This seems like a good application of Klee's measure. Basically, if you read http://en.wikipedia.org/wiki/Klee%27s_measure_problem there are lower bounds on the runtime of the best algorithms that can be found for rectilinear intersections at O(n log n).
I think you should use something like the sweep line algorithm: finding intersections is one of its applications. Also, have a look at answers to this questions
Since the rectangles must not be parallel to the axis, it is easier to transform the problem to an already solved one: compute the intersections of the borders of the rectangles.
build a set S which contains all borders, together with the rectangle they're belonging to; you get a set of tuples of the form ((x_start,y_start), (x_end,y_end), r_n), where r_n is of course the ID of the corresponding rectangle
now use a sweep line algorithm to find the intersections of those lines
The sweep line stops at every x-coordinate in S, i.e. all start values and all end values. For every new start coordinate, put the corresponding line in a temporary set I. For each new end-coordinate, remove the corresponding line from I.
Additionally to adding new lines to I, you can check for each new line whether it intersects with one of the lines currently in I. If they do, the corresponding rectangles do, too.
You can find a detailed explanation of this algorithm here.
The runtime is O(n*log(n) + c*log(n)), where c is the number of intersection points of the lines in I.
Pick the smallest rectangle from the set (or any rectangle), and go over each point within it. If one of it's point also exists in all other rectangles, the intersection is not empty. If all points are free from ALL other rectangles, the intersection is empty.

How can I sample the parametric boundary of an object by N points resulting in equal arc-length parts?

The parametric boundary of an object can be extract in Matlab by using the bwtraceboundary function. It returns a Q-by-2 matrix B, where Q is the number of boundary pixels for the object and the first and second columns stores the row and column coordinates of the boundary pixels respectively.
What I want to do is to sample this boundary of Q elements by N points that divide the original boundary in segments of equal arch length.
A straightfoward solution that I thought consists in computing the length L of the boundary by summing the distance of all two consecutive boundary pixels. Those distances are either 1 or sqrt(2). Then I divide L by N to find the desired length of the arcs. Finally, I iterate over the boundary again summing the distance of all two consecutive boundary pixels. When the sum is greater or equal the desired arc-length, the current boundary pixel is chosen as one of the N that will compose the sampled boundary.
Is that a good solution? Is there a more efficient/simple solution?
Over the years, I have seen this question a seemingly vast number of times. So I wrote a little tool that will do exactly that. Sample a piecewise linear or even a curvilinear (spline) arc in a general number of dimensions so that the successive points are at a uniform or specified distance along that arc.
In the case of the use of merely piecewise linear arcs, this is rather easy. You sum up the total arc length of the curve, then do an interpolation in arc length, but since that is known to be piecewise linear, it only requires linear interpolation along that length as a function of the cumulative arc length.
In the case of a curved arc, it is most easily done as the solution of a system of ordinary differential equations, watching for events along the way. ODE45 does this nicely.
You can use interparc, as found on the MATLAB Central File Exchange to do this for you, or if you wish to learn to do it yourself for the simple piecewise linear case, read through the first part of the code where I do the piecewise linear arc length interpolation. A nice thing is the linear case is done in a fully vectorized form, so no explicit loops are necessary.

Resources