permuting the rows and columns of a matrix for clustering [closed] - matrix

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have a distance matrix that is 1000x1000 in dimension and symmetric with 0s along the diagonal. i want to form groupings of distances (clusters) by simultaneously reordering the rows and columns of the matrix. this is like reordering a matrix before you visualize its clusters with a heatmap. i feel like this should be an easy problem, but i am not having much luck finding code that does the permutations online. can anyone help?

Here is one approach that came to mind:
"Sparsify" the matrix so that only "sufficiently close" neighbors have a nonzero value in the matrix.
Use a Cuthill-McKee algorithm to compress the bandwidth of the sparse matrix.
Do a symmetric reordering of the original matrix using the results from Step 2.
Example
I will use Octave (everything I am doing should also work in Matlab) since it has a Reverse Cuthill-McKee (RCM) implementation built in.
First, we need to generate a distance matrix. This function creates a random set of points and their distance matrix:
function [x, y, A] = make_rand_dist_matrix(n)
x = rand(n, 1);
y = rand(n, 1);
A = sqrt((repmat(x, 1, n) - repmat(x', n, 1)).^2 +
(repmat(y, 1, n) - repmat(y', n, 1)).^2);
end
Let's use that to generate and visualize a 100-point example.
[x, y, A] = make_rand_dist_matrix(100);
surf(A);
Viewing the surface plot from above gets the image below (yours will be different, of course).
Warm colors represent greater distances than cool colors. Row (or column, if you prefer) i in the matrix contains the distances between point i and all points. The distance between point i and point j is in entry A(i, j). Our goal is to reorder the matrix so that the row corresponding to point i is near rows corresponding to points a short distance from i.
A simple way to sparsify A is to make all entries greater than some threshold zero, and that is what is done below, although more sophisticated approaches may prove more effective.
B = A < 0.2; % sparsify A -- only values less than 0.2 are nonzeros in B
p = symrcm(B); % compute reordering by Reverse Cuthill-McKee
surf(A(p, p)); % visualize reordered distance matrix
The matrix is now ordered in a way that brings nearby points closer together in the matrix. This result is not optimal, of course. Sparse matrix bandwidth compression is computed using heuristics, and RCM is a very simple approach. As I mentioned above, more sophisticated approaches for producing the sparse matrix may give better results, and different algorithms may also yield better results for the problem.
Just for Fun
Another way to look at what happened is to plot the points and connect a pair of points if their corresponding rows in the matrix are adjacent. Your goal is to have the lines connecting pairs of points that are near each other. For a more dramatic effect, we use a larger set of points than above.
[x, y, A] = make_rand_dist_matrix(2000);
plot(x, y); % plot the points in their initial, random order
Clearly, connections are all over the place and are occurring over a wide variety of distances.
B = A < 0.2; % sparsify A
p = symrcm(B);
plot(x(p), y(p)) % plot the reordered points
After reordering, the connections tend to be over much smaller distances and much more orderly.

Two Matlab functions do this: symrcm and
symamd.
Note that there is no unique solution to this problem. Clustering is another approach.

Related

How many paths of length n with the same start and end point can be found on a hexagonal grid?

Given this question, what about the special case when the start point and end point are the same?
Another change in my case is that we must move at every step. How many such paths can be found and what would be the most efficient approach? I guess this would be a random walk of some sort?
My think so far is, since we must always return to our starting point, thinking about n/2 might be easier. At every step, except at step n/2, we have 6 choices. At n/2 we have a different amount of choices depending on if n is even or odd. We also have a different amount of choices depending on where we are (what previous choices we made). For example if n is even and we went straight out, we only have one choice at n/2, going back. But if n is even and we didn't go straight out, we have more choices.
It is all the cases at this turning point that I have trouble getting straight.
Am I on the right track?
To be clear, I just want to count the paths. So I guess we are looking for some conditioned permutation?
This version of the combinatorial problem looks like it actually has a short formula as an answer.
Nevertheless, the general version, both this and the original question's, can be solved by dynamic programming in O (n^3) time and O (n^2) memory.
Consider a hexagonal grid which spans at least n steps in all directions from the target cell.
Introduce a coordinate system, so that every cell has coordinates of the form (x, y).
Let f (k, x, y) be the number of ways to arrive at cell (x, y) from the starting cell after making exactly k steps.
These can be computed either recursively or iteratively:
f (k, x, y) is just the sum of f (k-1, x', y') for the six neighboring cells (x', y').
The base case is f (0, xs, ys) = 1 for the starting cell (xs, ys), and f (0, x, y) = 0 for every other cell (x, y).
The answer for your particular problem is the value f (n, xs, ys).
The general structure of an iterative solution is as follows:
let f be an array [0..n] [-n-1..n+1] [-n-1..n+1] (all inclusive) of integers
f[0][*][*] = 0
f[0][xs][ys] = 1
for k = 1, 2, ..., n:
for x = -n, ..., n:
for y = -n, ..., n:
f[k][x][y] =
f[k-1][x-1][y] +
f[k-1][x][y-1] +
f[k-1][x+1][y] +
f[k-1][x][y+1]
answer = f[n][xs][ys]
OK, I cheated here: the solution above is for a rectangular grid, where the cell (x, y) has four neighbors.
The six neighbors of a hexagon depend on how exactly we introduce a coordinate system.
I'd prefer other coordinate systems than the one in the original question.
This link gives an overview of the possibilities, and here is a short summary of that page on StackExchange, to protect against link rot.
My personal preference would be axial coordinates.
Note that, if we allow standing still instead of moving to one of the neighbors, that just adds one more term, f[k-1][x][y], to the formula.
The same goes for using triangular, rectangular, or hexagonal grid, for using 4 or 8 or some other subset of neighbors in a grid, and so on.
If you want to arrive to some other target cell (xt, yt), that is also covered: the answer is the value f[n][xt][yt].
Similarly, if you have multiple start or target cells, and you can start and finish at any of them, just alter the base case or sum the answers in the cells.
The general layout of the solution remains the same.
This obviously works in n * (2n+1) * (2n+1) * number-of-neighbors, which is O(n^3) for any constant number of neighbors (4 or 6 or 8...) a cell may have in our particular problem.
Finally, note that, at step k of the main loop, we need only two layers of the array f: f[k-1] is the source layer, and f[k] is the target layer.
So, instead of storing all layers for the whole time, we can store just two layers, as we don't need more: one for odd k and one for even k.
Using only two layers is as simple as changing all f[k] and f[k-1] to f[k%2] and f[(k-1)%2], respectively.
This lowers the memory requirement from O(n^3) down to O(n^2), as advertised in the beginning.
For a more mathematical solution, here are some steps that would perhaps lead to one.
First, consider the following problem: what is the number of ways to go from (xs, ys) to (xt, yt) in n steps, each step moving one square north, west, south, or east?
To arrive from x = xs to x = xt, we need H = |xt - xs| steps in the right direction (without loss of generality, let it be east).
Similarly, we need V = |yt - ys| steps in another right direction to get to the desired y coordinate (let it be south).
We are left with k = n - H - V "free" steps, which can be split arbitrarily into pairs of north-south steps and pairs of east-west steps.
Obviously, if k is odd or negative, the answer is zero.
So, for each possible split k = 2h + 2v of "free" steps into horizontal and vertical steps, what we have to do is construct a path of H+h steps east, h steps west, V+v steps south, and v steps north. These steps can be done in any order.
The number of such sequences is a multinomial coefficient, and is equal to n! / (H+h)! / h! / (V+v)! / v!.
To finally get the answer, just sum these over all possible h and v such that k = 2h + 2v.
This solution calculates the answer in O(n) if we precalculate the factorials, also in O(n), and consider all arithmetic operations to take O(1) time.
For a hexagonal grid, a complicating feature is that there is no such clear separation into horizontal and vertical steps.
Still, given the starting cell and the number of steps in each of the six directions, we can find the final cell, regardless of the order of these steps.
So, a solution can go as follows:
Enumerate all possible partitions of n into six summands a1, ..., a6.
For each such partition, find the final cell.
For each partition where the final cell is the cell we want, add multinomial coefficient n! / a1! / ... / a6! to the answer.
Just so, this takes O(n^6) time and O(1) memory.
By carefully studying the relations between different directions on a hexagonal grid, perhaps we can actually consider only the partitions which arrive at the target cell, and completely ignore all other partitions.
If so, this solution can be optimized into at least some O(n^3) or O(n^2) time, maybe further with decent algebraic skills.

Searching the k nearest elements [duplicate]

This question already has answers here:
Finding K-nearest neighbors and its implementation
(1 answer)
Efficiently compute pairwise squared Euclidean distance in Matlab
(2 answers)
Closed 5 years ago.
I have a bunch (more or less 3500) of vectors with 4096 components and I need a fast method to see, given an input of another vector with the same length, which are the nearest N.
I would like to use some matlab functions to do that. Is this ok for what I need?
https://uk.mathworks.com/help/stats/classificationknn-class.html
What you are suggesting is a clustering function, which should make N clusters out of all your vectors. Not sure this is what you want. If you simply want N minimum distances between the bunch of vectors, you can do it manually easy enough. Something like:
distances = matrixOfvectors - yourVector; % repmat(your...) if you have older Matlab.
[val, pos] = sort(sum(distances.^2, 2)); % Sum might need 1 instead of 2, depends whether vectors are rows or columns.
minVectors = pos(1:N); % Take indices of N nearest to get which vectors are the closest.
If N is small, say 3 or less, it would be slightly faster to avoid sort and just simply compare each new vector with 2nd biggest first, then with 1st or 3rd depending on the outcome.

Algorithm for finding all combinations of (x,y,z,j) that satisfy w+x = y+j, where w,x,y,j are integers between -N...N inclusive

I'm working on a problem that requires an array (dA[j], j=-N..N) to be calculated from the values of another array (A[i], i=-N..N) based on a conservation of momentum rule (x+y=z+j). This means that for a given index j for all the valid combinations of (x,y,z) I calculate A[x]A[y]A[z]. dA[j] is equal to the sum of these values.
I'm currently precomputing the valid indices for each dA[j] by looping x=-N...+N,y=-N...+N and calculating z=x+y-j and storing the indices if abs(z) <= N.
Is there a more efficient method of computing this?
The reason I ask is that in future I'd like to also be able to efficiently find for each dA[j] all the terms that have a specific A[i]. Essentially to be able to compute the Jacobian of dA[j] with respect to dA[i].
Update
For the sake of completeness I figured out a way of doing this without any if statements: if you parametrize the equation x+y=z+j given that j is a constant you get the equation for a plane. The constraint that x,y,z need to be integers between -N..N create boundaries on this plane. The points that define this boundary are functions of N and j. So all you have to do is loop over your parametrized variables (s,t) within these boundaries and you'll generate all the valid points by using the vectors defined by the plane (s*u + t*v + j*[0,0,1]).
For example, if you choose u=[1,0,-1] and v=[0,1,1] all the valid solutions for every value of j are bounded by a 6 sided polygon with points (-N,-N),(-N,-j),(j,N),(N,N),(N,-j), and (j,-N).
So for each j, you go through all (2N)^2 combinations to find the correct x's and y's such that x+y= z+j; the running time of your application (per j) is O(N^2). I don't think your current idea is bad (and after playing with some pseudocode for this, I couldn't improve it significantly). I would like to note that once you've picked a j and a z, there is at most 2N choices for x's and y's. So overall, the best algorithm would still complete in O(N^2).
But consider the following improvement by a factor of 2 (for the overall program, not per j): if z+j= x+y, then (-z)+(-j)= (-x)+(-y) also.

Labelling a grid using n labels, where every label neighbours every other label

I am trying to create a grid with n separate labels, where each cell is labelled with one of the n labels such that all labels neighbour (edge-wise) all other labels somewhere in the grid (I don't care where). Labels are free to appear as many times as necessary, and I'd like the grid to be as small as possible. As an example, here's a grid for five labels, 1 to 5:
3 2 4
5 1 3
2 4 5
While generating this by hand is not too bad for small numbers of labels, it appears to be very hard to generate a grid of reasonable size for larger numbers and so I'm looking to write a program to generate them, without having to resort to a brute-force search. I imagine this must have been investigated before, but the closest I've found are De Bruijn tori, which are not quite what I'm looking for. Any help would be appreciated.
EDIT: Thanks to Benawii for the following improved description:
"Given an integer n, generate the smallest possible matrix where for every pair (x,y) where x≠y and x,y ∈ {1,...,n} there exists a pair of adjacent cells in the matrix whose values are x and y."
You can experiment with a simple greedy algorithm.
I don't think that I'm able to give you a strict mathematical prove, at least because the question is not strictly defined, but the algorithm is quite intuitive.
First, if you have 1...K numbers (K labels) then you need at least K*(K-1)/2 adjacent cells (connections) for full coverage. A matrix of size NxM generates (N-1)*M+(M-1)*N=2*N*M-(N+M) connections.
Since you didn't mention what you understand under 'smallest matrix', let's assume that you meant the area. In that case it is obvious that for the given area the square matrix will generate bigger number of connections because it has more 'inner' cells adjacent to 4 others. For example, for area 16 the matrix 4x4 is better than 2x8. 'Better' is intuitive - more connections and more chances to reach the goal. So lets use target square matrixes and expand them if needed. The above formula will become 2*N*(N-1).
Then we can experiment with the following greedy algorithm:
For input number K find the N such that 2*N*(N-1)>K*(K-1)/2. A simple school equation.
Keep an adjacency matrix M, set M[i][i]=1 for all i, and 0 for the rest of the pairs.
Initialize a resulting matrix R of size NxN, fill with 'empty value' markers, for example with -1.
Start from top-left corner and iterate right-down:
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
R[i][j];
for each such R[i][j] (which is -1 now) find such a value which will 'fit best'. Again, 'fit best' is an intuitive definition, here we understand such a value that will contribute to a new unused connection. For that reason create the set of already filled cell neighbor numbers - S, its size is 2 at most (upper and left neighbor). Then find first k such that M[x][k]=0 for both numbers x in S. If no such number then try at least one new connection, if no number even for one then both neighbors are completely covered, put some number from uncovered here - probably the one in 'worst situation' - such x where Sum(M[x][i]) is the smallest. You should also choose the one in 'worst situation' when there are several ones to choose from in any case.
After setting the value for R[i][j] don't forget to mark the new connections with numbers x from S - M[R[i][j]][x] = M[x][R[i][j]] = 1.
If the matrix is filled and there are still unmarked connections in M then append another row to the matrix and continue. If all the connections are found before the end then remove extra rows.
You can check this algorithm and see what will happen. Step 5 is the place for playing around, particularly in guessing which one to choose in equal situation (several numbers can be in equally 'worst situation').
Example:
for K=6 we need 15 connections:
N=4, we need 4x4 square matrix. The theory says that 4x3 matrix has 17 connections, so it can possibly fit, but we will try 4x4.
Here is the output of the algorithm above:
1234
5615
2413
36**
I'm not sure if you can do by 4x3, maybe yes... :)

Fastest way to check if a vector increases matrix rank

Given an n-by-m matrix A, with it being guaranteed that n>m=rank(A), and given a n-by-1 column v, what is the fastest way to check if [A v] has rank strictly bigger than A?
For my application, A is sparse, n is about 2^12, and m is anywhere in 1:n-1.
Comparing rank(full([A v])) takes about a second on my machine, and I need to do it tens of thousands of times, so I would be very happy to discover a quicker way.
There is no need to do repeated solves IF you can afford to do ONE computation of the null space. Just one call to null will suffice. Given a new vector V, if the dot product with V and the nullspace basis is non-zero, then V will increase the rank of the matrix. For example, suppose we have the matrix M, which of course has a rank of 2.
M = [1 1;2 2;3 1;4 2];
nullM = null(M')';
Will a new column vector [1;1;1;1] increase the rank if we appended it to M?
nullM*[1;1;1;1]
ans =
-0.0321573705742971
-0.602164651199413
Yes, since it has a non-zero projection on at least one of the basis vectors in nullM.
How about this vector:
nullM*[0;0;1;1]
ans =
1.11022302462516e-16
2.22044604925031e-16
In this case, both numbers are essentially zero, so the vector in question would not have increased the rank of M.
The point is, only a simple matrix-vector multiplication is necessary once the null space basis has been generated. If your matrix is too large (and the matrix nearly of full rank) that a call to null will fail here, then you will need to do more work. However, n = 4096 is not excessively large as long as the matrix does not have too many columns.
One alternative if null is too much is a call to svds, to find those singular vectors that are essentially zero. These will form the nullspace basis that we need.
I would use sprank for sparse matrixes. Check it out, it might be faster than any other method.
Edit : As pointed out correctly by #IanHincks, it is not the rank. I am leaving the answer here, just in case someone else will need it in the future.
Maybe you can try to solve the system A*x=v, if it has a solution that means that the rank does not increase.
x=(B\A)';
norm(A*x-B) %% if this is small then the rank does not increase

Resources