Clustering a boolean matrix in Matlab - algorithm

Suppose we have a Boolean matrix such as the following:
0 0 1 0 0 1 0
1 1 0 0 1 0 0
0 0 0 0 0 1 1
0 0 0 0 0 1 0
0 0 0 0 0 1 1
interpreted this way: each row is a fruit and each column is a person. A '1' in position (i, j) indicates that person j would like to eat fruit i.
I would like to 'cluster' this matrix, creating sub-matrices that indicate subsets of people competing for subsets of fruit. In the example above I would like to see in output:
0 0 1 0 0 1 0
0 0 0 0 0 0 0
0 0 0 0 0 1 1
0 0 0 0 0 1 0
0 0 0 0 0 1 1
and
0 0 0 0 0 0 0
1 1 0 0 1 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Is there a simple way to do this, for example, in Matlab?
Thanks.

The description is way too informal and engineering something based on a single example is probably not a good idea.
Howewer: if the example just shows a 2 partition (which is my interpretation), this can easily achieved by:
- Create undirected graph G with one vertex for each row
- Iterate over all "N over 2" row-pairs (= nested i,j loop skipping symmetries)
- If the pair (rowA, rowB) shares some 1 in a column -> add edge (rowA, rowB) to G
- Compute all "connected components" of G
Any sane graph-lib will provide the primitives needed.

Related

Algorithm for iteratively testing 2d grid connectiveness

Let's say that I have a 2D grid size that can hold either a zero or one at each index. The grid starts off as full of zeros and then ones are progressively added. At each step, I want to verify that adding the next one will not prevent the zeros from forming one connected component (using a 4-connected grid with north, east, south, and west neighbors).
What is a fast algorithm that will iteratively test a 2D grid for connectedness?
Currently I am using a flood fill at each iteration, but I feel there should be a faster algorithm that uses information from previous iterations.
Additionally, the method that places the ones will sometimes unplace the ones even if they don't disconnect the grid, so the algorithm I'm looking for needs to be able to handle that.
This is inspired by Kruskal's algorithm for maze generation.
I am defining the neighborhood of a square as its 8 surrounding squares, including the outside of the grid (the neighborhood of a corner square is its 3 surrounding squares plus the outside, so 4 "squares" total).
Put the 1s in sets so that any two neighboring 1s belong to the same set. Treat the outside of the grid as one big 1 (which means the first set contains it). When adding a 1, you only need to check its neighbors.
Below are all the possible cases. To make it easier to visualize, I'll number the sets starting from 1 and use the set number instead of the 1 in each square that contains a 1. The outside belongs to the set numbered 1. You can also use this to simplify the implementation. The brackets indicate the newly placed 1.
If the new 1 has no neighboring 1, then it belongs to a new set.
0 0 0 0 0
0 2 0 0 0
0 0 0[3]0
0 0 0 0 0
0 0 1 0 0
If it has one neighboring 1, then it belongs to the same set.
0 0 0 0 0
0 2 0 0 0
0 0[2]0 0
0 0 0 0 0
0 0 1 0 0
If it has multiple neighboring 1s, and all neighbors belonging to the same set are direct neighbors, then you can merge the sets and the new 1 belongs to the resulting set. You don't need to check for a disconnection.
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 0[3]1 0 -> 0 0[1]1 0
0 0 1 1 0 0 0 1 1 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 2 0 1 0 -> 0 1 0 1 0
[3]0 0 1 0 [1]0 0 1 0
1 1 1 0 0 1 1 1 0 0
If it has multiple neighboring 1s of the same set, but they are not all direct neighbors, then you have a disconnection.
0 0 0 0 0 0 0 0 0 0 <- first group of 0s
0 2 0 0 0 0 1 0 0 0
0 0[3]1 0 -> 0 0[1]1 0
0 1 0 1 1 0 1 0 1 1
1 0 0 0 0 1 0 0 0 0 <- second group of 0s
0 0 0 0 0 <- first group of 0s
0 0 1 0 0
0 1 0 1 1
[1]1 0 0 0
0 0 0 0 0 <- second group of 0s
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 2 0 1 0 -> 0 1 0 1 0
[3]0 0 1 0 [1]0 0 1 0
0{1}1 0 0 lone 0 -> 0{1}1 0 0
In this last example, the 1 marked {1} and the outside technically are neighbors, but not from the point of view of the newly placed 1.
In the general case, when removing a 1 that has multiple neighboring 1s, you need to check whether they are still connected after the removal (for example, by running a pathfinder between them). If not, separate them in different sets.
If you know the 0s are all connected, then you can check locally: removing a 1 will not split the set it belongs to if its neighbors are all direct neighbors (careful with the outside, though). It will if there is are multiple "gaps" in its neighborhood.
In the special case where you only remove the 1s in the reverse order you added them, you can keep track of which newly added 1s join multiple sets (and even what the sets are at that moment, if you need). These will split their set when you remove them later on.

Find cluster boundary in given matrix

Given matrix contains many clusters. Cluster is represented by 1's.
For Example:
0 1 1 1 0
1 1 1 0 0
0 0 0 1 1
1 1 0 0 1
In this example, there are 3 clusters (connected 1's horizontally or vertically).
Now suppose that matrix size is very big and it contains too many clusters.
Now my question is, I want to know the boundary of all the clusters.
For example given matrix:
0 0 1 1 0
0 1 1 1 1
0 0 1 1 0
0 0 0 1 0
Now output should be coordinates of bold locations:
0 0 1 1 0
0 1 1 1 1
0 0 1 1 0
0 0 0 1 0
Consider matrix is huge with many such clusters, suggest optimized way of finding boundaries of all clusters.
A 1 is on the boundary of some cluster only when it has an 0 as one of its immediate neighbors or when it's on the outline of the matrix, so the naive method would be outputting the coordinates of 1's on the outline, then going through all the other matrix cells and outputting the coordinates of 1's satisfying the said condition.
It doesn't look like there is a way to optimize this algorithm, though.
Suppose we have a huge matrix with only one random 1 inside it. There is no way to find it without going through each and every cell out there.
There is also no way to skip 1's of the same cluster by following the outline of it.
Consider this example:
0 0 0 1 1 1 1 1 1 1
0 0 0 1 0 0 0 0 0 1
0 1 1 1 1 1 0 1 0 1
0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 1 1 1 1 1
One cluster can easily fit inside another one, so following the outline is not an option.

How to create a symmetric matrix of 1's and 0's with constant row and column sum

I'm trying to find an elegant algorithm for creating an N x N matrix of 1's and 0's, under the restrictions:
each row and each column must sum to Q (to be picked freely)
the diagonal must be 0's
the matrix must be symmetrical.
It is not strictly necessary for the matrix to be random (both random and non-random solutions are interesting, however), so for Q even, simply making each row a circular shift of the vector
[0 1 1 0 ... 0 0 0 ... 0 1 1] (for Q=4)
is a valid solution.
However, how to do this for Q odd? Or how to do it for Q even, but in a random fashion?
For those curious, I'm trying to test some phenomena on abstract networks.
I apologize if this has already been answered before, but none of the questions I could find had the symmetric restriction, which seems to make it much more complicated. I don't have a proof that such a matrix always exists, but I do assume so.
The object that you're trying to construct is known more canonically as an undirected d-regular graph (where d = Q). By the handshaking theorem, N and Q cannot both be odd. If Q is even, then connect vertex v to v + k modulo N for k in {-Q/2, -Q/2 + 1, ..., -1, 1, ..., Q/2 - 1, Q/2}. If Q is odd, then N is even. Construct a (Q - 1)-regular graph as before and then add connections from v to v + N/2 modulo N.
If you want randomness, there's a Markov chain whose limiting distribution is uniform on d-regular graphs. You start with any d-regular graph. Repeatedly pick vertices v, w, x, y at random. Whenever the induced subgraph looks like
v----w
x----y ,
flip it to
v w
| |
x y .
You can perhaps always follow your circular shift algorithm, when possible.
The only condition you need to follow while using the circular shift algorithm is to maintain the symmetric nature in the first row.
i.e. keeping Q 1's in the first row so that Q[0,1] to Q[0,N-1] {Assuming 0 indexed rows and cols, Q[0,0] is 0.} is symmetric, a simple example being 110010011.
Hence, N = 10, Q = 5, you can get many possible arrangements such as:
0 1 0 0 1 1 1 0 0 1
1 0 1 0 0 1 1 1 0 0
0 1 0 1 0 0 1 1 1 0
0 0 1 0 1 0 0 1 1 1
1 0 0 1 0 1 0 0 1 1
1 1 0 0 1 0 1 0 0 1
1 1 1 0 0 1 0 1 0 0
0 1 1 1 0 0 1 0 1 0
0 0 1 1 1 0 0 1 0 1
1 0 0 1 1 1 0 0 1 0
or
0 1 1 0 0 1 0 0 1 1
1 0 1 1 0 0 1 0 0 1
1 1 0 1 1 0 0 1 0 0
0 1 1 0 1 1 0 0 1 0
0 0 1 1 0 1 1 0 0 1
1 0 0 1 1 0 1 1 0 0
0 1 0 0 1 1 0 1 1 0
0 0 1 0 0 1 1 0 1 1
1 0 0 1 0 0 1 1 0 1
1 1 0 0 1 0 0 1 1 0
But as you can see for odd N(that means even N-1) and odd Q there can't be any such symmetric distribution.. Hope it helped.

How can I find a solution of binary matrix equation AX = B?

Given an m*n binary matrix A, m*p binary matrix B, where n > m what is an efficient algorithm to compute X such that AX=B?
For example:
A =
1 1 0 0 1 1 0 1 0 0
1 1 0 0 1 0 1 0 0 1
0 1 1 0 1 0 1 0 1 0
1 1 1 1 1 0 0 1 1 0
0 1 1 0 1 0 1 1 1 0
B =
0 1 0 1 1 0 1 1 0 1 0 0 1 0
0 0 1 0 1 1 0 0 0 1 0 1 0 0
0 1 1 0 0 0 1 1 0 0 1 1 0 0
0 0 1 1 1 1 0 0 0 1 1 0 0 0
1 0 0 1 0 0 1 0 1 0 0 1 1 0
Note, when I say binary matrix I mean matrix defined over the field Z_2, that is, where all arithmetic is mod 2.
If it is of any interest, this is a problem I am facing in generating suitable matrices for a random error correction code.
You can do it with row reduction: Place B to the right of A, and then swap rows (in the whole thing) to get a 1 in row 0, col 0; then xor that row to any other row that has a '1' in column 0, so you have only a single 1 in column 0. Then move to the next column; if [1,1] is zero then swap row 1 with a later row that has a 1 there, then xor rows to make it the only 1 in the column. Assuming 'A' is a square matrix and a solution exists, then you eventually have converted A to unity, and B is replaced with the solution to Ax=B.
If n > m, you have a system with more unknowns than equations, so you can solve for some of the unknowns, and set the others to zero. During the row reduction, if there are no values in a column which have a '1' to use (below the rows already reduced) you can skip that column and make the corresponding unknown zero (you can do this at most n-m times).

Fast Connected Component Identification in Undirected Graphs in R

Given a node x in an undirected graph that is known to be part of a connected component, I seek to find all nodes belonging to the component of x.
My current implementation identifies all components in the undirected graph and is therefore inneficient for large graphs. I currently use connectedComp from ggm library to do this, but would rather run a BFS from RBGL starting at node x and terminating once its component is fully explored. Any suggestions on how to do this? Also, any information on parallel graph algorithm implementations that can be called from R would be appreciated.
library("ggm")
x <- 2
> graph
1 2 3 4 5 6 7 8 9 10
1 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 1 0 0 0 0
3 0 1 0 0 0 1 1 1 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 1 1 0 0 0 0 0 0 0
7 0 0 1 0 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
graph_object <- as(graph, "graphNEL")
# All connected components of graph using connectedComp function:
comp_list <- connectedComp(graph_object)
> comp_list
$`1`
[1] "1"
$`2`
[1] "2" "3" "6" "7" "8"
$`3`
[1] "4"
$`4`
[1] "5"
$`5`
[1] "9"
$`6`
[1] "10"
# Extract adjacency matrix of component containing x:
comp_x <- seq_along(comp_list)[sapply(comp_list, FUN=function(list) x %in% list)]
> comp_x
[1] 2
comp_x_list <- comp_list[[comp_x]]
> comp_x_list
[1] "2" "3" "6" "7" "8"
comp_x <- graph[comp_x_list, comp_x_list]
> comp_x
2 3 6 7 8
2 0 1 1 0 0
3 1 0 1 1 1
6 1 1 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
In my opinion preprocessing graph with Union-find will give you best results.
It would be faster if you store graph as list of edges instead of adjacency matrix.
If you need parallel solution, then you should read about bfs in hadoop

Resources