Finding the minimum number of subrectangles in a binary array - algorithm

I'm trying to program an algorithm to find the minimum number of rectangles needed to cover every "1" in a randomly generated binary 0 and 1 matrix. I've found some information online, but there are a few caveats with this specific problem that have me stumped.
Each group should contain the largest number of 'ones' and no blank cells.
The number of 'ones' in a group must be a power of 2 i.e. a group can contain: 16, 8, 4, 2, or 1 'ones'
Grouping is carried-on in decreasing order meaning, one has to try to group for 8 (octet) first,
then for 4 (quad), followed by 2 and lastly for 1 (isolated 'ones').
Grouping is done either horizontally or vertically or in terms of rectangles/squares. Diagonal
grouping of 'ones' is not permitted.
The same element(s) may repeat in multiple groups only if this increases the size of the group.
The elements around the edges of the matrix, including the four corners, are considered to be
adjacent and can be grouped together.
I haven't had any issues generating the matrix. And I've read other sources on algorithms to find the biggest subrectangle in a matrix, but not all of them and especially not if you include overlaps. My biggest problem is figuring out the last requirement, and how I would express the outer edges being adjacent algorithmically.

Once I learned the actual term for this type of problem, Karnough Map, I was able to look up other resources on it. Thanks, Mathur.
I used this website as a reference specifically for c++ implementation.
https://www.codeproject.com/Articles/649849/A-Cplusplus-Karnaugh-Map-Minimizer-Infinite-Variab

Related

Find every possible permutation of bits in a 2D array that has a single group of contiguous 1s

Below I have represented 2 permutations of bits in a 2D bit array (1s are red). The matrix on the left has a single group of contiguous 1s but the right matrix has 2.
I would like to loop through every possible permutation of binary values in such an array that has a single group of contiguous 1s. I am aware that for a 10×7 grid like above there are 2(10 × 7) permutations when you include non-contiguous permutations, but my hope is that by excluding non-contiguous permutations I will be able to go through them all in reasonable CPU time.
Speaking of reasonableness, I am also interested in an algorithm to determine how many permutations are contiguous.
My question is similar to, but different from, these:
2D Bit matrix with every possible combination
Finding Contiguous Areas of Bits in 2D Bit Array
Any help is appreciated. I'm a bit stuck.
So, I found that the OEIS (Online Encyclopedia of Integer Sequences) has a sequence from n = 0..7 for the "number of nonzero n X n binary arrays with all 1's connected" (A059525). They provide no formula though except for grids fixed at 1 cell wide (triangular numbers), or 2 or 3 cells wide. There's a sequence for 4 x n too but no formula.
Two approaches I can think of. One is to iterate through all possible sequences and devise a test for non-contiguous groups and some method for skipping over large regions guaranteed to be non-contiguous.
A second approach is to attempt to build all sets of contiguous groups so you don't need to test. This is the approach I would take:
Let n = width * height
Enumerate blocks left to right, top to bottom from 0 to n - 1
Fix a block at position 0.
Generate all contiguous solutions between 1 and n blocks extending from position 0
Omit position 0 and fix a block at position 1
Find all contiguous solutions between 1 and n - 1 blocks extending from position 1
Continue until you've reached position n
You can place your pieces according to the following rules, backtracking for the next placement at each depth:
To left of most recently placed piece if placed in row above prior piece provided that no other neighbors exist for that vacancy.
Above left-most available piece in row of most recently placed piece if no other neighbors exist for that vacancy.
To right of most recently placed piece (if adjacent piece exists)
On the next row, farthest left vacancy such that upper row has a piece above any contiguous right remaining neighbors
Next move for any backtracked position is first available move to the right of, or in the row below, the backtracked position (obeying prior 4 rules)

Is there a algorithm to extract the minimum number of Cartesian products from a set of formulas?

For example, we have a set of formulas as below:
B*2*j
B*3*i
B*3*j
C*2*j
C*3*i
C*3*j
D*2*i
D*2*j
D*3*i
D*3*j
And we could have three Cartesian products to represent the formulas above:
D*(2+3)*(i+j)
(B+c)*3*(i+j)
(B+C)*2*j
So the total number is 3. And we could also have:
3*(B+C+D)*(i+j)
2*(B+C)*D
2*D*(i+j)
which is also 3.
I wanna ask that is there a algorithm to determine the minimum number of Cartesian products from a set of formulas? And also come up with these products?
First, I'll write a set of formulas as terms separated by +, since the transformation you're looking for makes sense algebraically (apart from the fact that you don't want to combine numbers like 2+3 into 5).
The basic operation that you have available is factorising: combining two terms like ABC+ABD into AB(C+D). Based on your comment, you can only generate new factors that consist of a sum of single-factor terms, like C+D in the previous example; you're not allowed to factorise e.g. ABCD+ABDE into AB(CD+DE).
You can factorise 2 k-factor terms if and only if they share exactly k-1 factors. (E.g. k=3 in my ABC+ABD example.) Every such factorisation reduces the number of terms in the set by 1: 2 are removed and 1 is added back in.
Doing this multiple times works when combining 3 or more terms: ABC+ABD+ABE can first be factorised into AB(C+D)+ABE and then those 2 terms factorised again into AB(C+D+E). Notice that it doesn't matter in which order we list terms in a sum or factors in a product, and nor does it matter in which order we perform factorisation steps when building a factor containing 3 or more terms.
We can then frame the problem as a search problem in a graph, in which the start vertex corresponds to the original formula (B*2*j + B*3*i + ... + D*3*j in your example) and from each vertex v there emanate arcs to its child vertices, which each correspond to the result of performing some factorisation on v. v will have a child vertex for each possible factorisation that could be performed on it; if there are m terms in v, then this means it could have up to m(m-1)/2 children in the worst case, because it could be that all m terms share a full complement of k-1 factors, meaning that any pair of them could be combined.
If a vertex has no pair of terms that can be combined via factorisation then it is a "leaf" -- it has no children, and can't be processed further. What we want to find is a leaf vertex that has the fewest number of terms. Since every factorisation, corresponding to an arc in the graph, reduces the number of terms by 1, this is equivalent to searching for a deepest-possible vertex. This can be done using DFS or BFS. Note however that the same expression (vertex) can be generated many times over using this approach, so it will be crucial for performance to maintain a hashtable seen that records all expressions that have already been processed; then if we visit a vertex, try to generate a child for it, and see that this child is already in seen, we avoid visiting this child a second time.
To mitigate against the phenomenon of the same expression being generated via multiple different orderings of the same set of factorisations, you can add a rule: order v's child factorisations somehow, so that if there are n children they correspond to factorisations 1, 2, ..., n in this ordering, and record in a separate "already skipped" field in each child vertex the set of earlier (in the ordering) factorisations that were skipped over to generate this child. Then, when visiting a vertex, avoid generating any of its "already skipped" factorisations as children, since doing so would create a vertex that is identical to some other existing vertex (by performing the same pair of operations in reverse order).
There are probably other speedups available that will reduce the number of duplicate vertices that are generated in the first place, but this should be enough to get results for small problems.
Write down you sum in matrix form. Then what you are asking for is the rank of that matrix, and a corresponding decomposition into dyadic products. This decomposition is far from unique.
[ 3 5 ] [ i ]
[ B C D ] * | 3 5 | * [ j ]
[ 5 5 ]
As one can see, the matrix in the middle has full rank 2
If you intend to use 2 and 3 also as variables, then you are asking to decompose a tensor of order 3 into a minimum number of terms that factorize, i.e., that are tensor products of vectors.

How to find the minimum number of groups needed for grouping a list of elements?

For example:
There is a list of elements, e.g. a sorted integer list {1,2,3,4,6,8,11}.
I want to find the minimum number of groups, in which each member in the group have a
qualified difference between each other, e.g. less than or equal to 2 to each other.
In this case, we can see the answer should be 4, and one possible solution is {1,2,3},{4,6},{8},{11}.
How to write the algorithm to find the minimum number?
Furthermore, we can consider this in 2 dimension:
There is a list of points {(3,3),(3,6),(6,9)}. What is the minimum number of squares with length 3 I will need to use to cover all those points? The answer should be 2 here. But how to find it by programming?
I am asking this question for solving the question Square Fields in Google Code Jam Practice Contest, but any answer solving the first part of this question is good enough for me.
For 1-D: Sort the elements by ascending order, and greedily form groups by setting the new group's left bound as the smallest element not yet covered by any group.

Solving ACM ICPC - SEERC 2009

I have been sitting on this for almost a week now. Here is the question in a PDF format.
I could only think of one idea so far but it failed. The idea was to recursively create all connected subgraphs which works in O(num_of_connected_subgraphs), but that is way too slow.
I would really appreciate someone giving my a direction. I'm inclined to think that the only way is dynamic programming but I can't seem to figure out how to do it.
OK, here is a conceptual description for the algorithm that I came up with:
Form an array of the (x,y) board map from -7 to 7 in both dimensions and place the opponents pieces on it.
Starting with the first row (lowest Y value, -N):
enumerate all possible combinations of the 2nd player's pieces on the row, eliminating only those that conflict with the opponents pieces.
for each combination on this row:
--group connected pieces into separate networks and number these
networks starting with 1, ascending
--encode the row as a vector using:
= 0 for any unoccupied or opponent position
= (1-8) for the network group that that piece/position is in.
--give each such grouping a COUNT of 1, and add it to a dictionary/hashset using the encoded vector as its key
Now, for each succeeding row, in ascending order {y=y+1}:
For every entry in the previous row's dictionary:
--If the entry has exactly 1 group, add it's COUNT to TOTAL
--enumerate all possible combinations of the 2nd player's pieces
on the current row, eliminating only those that conflict with the
opponents pieces. (change:) you should skip the initial combination
(where all entries are zero) for this step, as the step above actually
covers it. For each such combination on the current row:
+ produce a grouping vector as described above
+ compare the current row's group-vector to the previous row's
group-vector from the dictionary:
++ if there are any group-*numbers* from the previous row's
vector that are not adjacent to any gorups in the current
row's vector, *for at least one value of X*, then skip
to the next combination.
++ any groups for the current row that are adjacent to any
groups of the previous row, acquire the lowest such group
number
++ any groups for the current row that are not adjacent to
any groups of the previous row, are assigned an unused
group number
+ Re-Normalize the group-number assignments for the current-row's
combination (**) and encode the vector, giving it a COUNT equal
to the previous row-vector's COUNT
+ Add the current-row's vector to the dictionary for the current
Row, using its encoded vector as the key. If it already exists,
then add it's COUNT to the COUNT for the pre-exising entry
Finally, for every entry in the dictionary for the last row:
If the entry has exactly one group, then add it's COUNT to TOTAL
**: Re-Normalizing simply means to re-assign the group numbers so as to eliminate any permutations in the grouping pattern. Specifically, this means that new group numbers should be assigned in increasing order, from left-to-right, starting from one. So for example, if your grouping vector looked like this after grouping ot to the previous row:
2 0 5 5 0 3 0 5 0 7 ...
it should be re-mapped to this normal form:
1 0 2 2 0 3 0 2 0 4 ...
Note that as in this example, after the first row, the groupings can be discontiguous. This relationship must be preserved, so the two groups of "5"s are re-mapped to the same number ("2") in the re-normalization.
OK, a couple of notes:
A. I think that this approach is correct , but I I am really not certain, so it will definitely need some vetting, etc.
B. Although it is long, it's still pretty sketchy. Each individual step is non-trivial in itself.
C. Although there are plenty of individual optimization opportunities, the overall algorithm is still pretty complicated. It is a lot better than brute-force, but even so, my back-of-the-napkin estimate is still around (2.5 to 10)*10^11 operations for N=7.
So it's probably tractable, but still a long way off from doing 74 cases in 3 seconds. I haven't read all of the detail for Peter de Revaz's answer, but his idea of rotating the "diamond" might be workable for my algorithm. Although it would increase the complexity of the inner loop, it may drop the size of the dictionaries (and thus, the number of grouping-vectors to compare against) by as much as a 100x, though it's really hard to tell without actually trying it.
Note also that there isn't any dynamic programming here. I couldn't come up with an easy way to leverage it, so that might still be an avenue for improvement.
OK, I enumerated all possible valid grouping-vectors to get a better estimate of (C) above, which lowered it to O(3.5*10^9) for N=7. That's much better, but still about an order of magnitude over what you probably need to finish 74 tests in 3 seconds. That does depend on the tests though, if most of them are smaller than N=7, it might be able to make it.
Here is a rough sketch of an approach for this problem.
First note that the lattice points need |x|+|y| < N, which results in a diamond shape going from coordinates 0,6 to 6,0 i.e. with 7 points on each side.
If you imagine rotating this diamond by 45 degrees, you will end up with a 7*7 square lattice which may be easier to think about. (Although note that there are also intermediate 6 high columns.)
For example, for N=3 the original lattice points are:
..A..
.BCD.
EFGHI
.JKL.
..M..
Which rotate to
A D I
C H
B G L
F K
E J M
On the (possibly rotated) lattice I would attempt to solve by dynamic programming the problem of counting the number of ways of placing armies in the first x columns such that the last column is a certain string (plus a boolean flag to say whether some points have been placed yet).
The string contains a digit for each lattice point.
0 represents an empty location
1 represents an isolated point
2 represents the first of a new connected group
3 represents an intermediate in a connected group
4 represents the last in an connected group
During the algorithm the strings can represent shapes containing multiple connected groups, but we reject any transformations that leave an orphaned connected group.
When you have placed all columns you need to only count strings which have at most one connected group.
For example, the string for the first 5 columns of the shape below is:
....+ = 2
..+++ = 3
..+.. = 0
..+.+ = 1
..+.. = 0
..+++ = 3
..+++ = 4
The middle + is currently unconnected, but may become connected by a later column so still needs to be tracked. (In this diagram I am also assuming a up/down/left/right 4-connectivity. The rotated lattice should really use a diagonal connectivity but I find that a bit harder to visualise and I am not entirely sure it is still a valid approach with this connectivity.)
I appreciate that this answer is not complete (and could do with lots more pictures/explanation), but perhaps it will prompt someone else to provide a more complete solution.

Find the "largest" dense sub matrix in a large sparse matrix

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough
I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.
I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.
Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.
EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Resources