Find smallest sum of values in matrix using row index and colum index once - algorithm

So I want to find the smallest values in a matrix in the following way.
[[ 1000. 930. 940. 740.]
[ 1000. 1000. 990. 670.]
M1= [ 1000. 1000. 1000. 680.]
[ 1000. 1000. 1000. 1000.]]
The sum of 2 matrix values should be chosen in such a way that the indexes are used once 0,1,2,3. But also the sum of matrix values should be minimized.
So in this case the solution would be M1[2][3] and M1[0][1].
Incorrect would be M1[2][3] and M1[1][3], which hase a lower sum but is does not contain unique index numbers.
The solution should work for NxN matrices, N is even. So for 8x8 matrix, i want to find 4 elements. So that the index Numbers. 0,1,2,3,4,5,6,7 are uses once. So four matrix values.
Another constraint is that the matrix contains only values of intrest in the upper trangle matrix. So were the matrix elements are 1000, these elements can be ignored in finding the minimum sum.
I have tried to alter the Hungarian algorithm, but this was not successful.
Does anybody know of an algorithm that does what I want? Maybe a python package wich I can abuse
Or has a smart solution which would help, I have to do this matrix with about 200X200 elements max.

I will say a solution that is probably not the fastest but it may work.
You can build a graph this way:
the graph will contain (N×N+1) vertexes, which represent the indexes of the matrix and a new one, which will be the source
the source will be connected to all other vertexes with a distance equivalent to the value of the index each of them represents.
then you must connect each vertex (except the source) to every other vertex that is possible to go to (for example, M1[1][2] can go to M1[0][3] but not to M1[1][3]). The distance from any vertex to a vertex V will correspond to the value of V in the matrix.
after you build this graph, you should walk on it K steps (being K the number of possible matrix' indexes you will consider, for example, 2 in a 4x4 matrix like your example).
For each step you take, you store in a stack and in 2 hashes the last position you were (the first to store all rows already used, the second to store all columns already used) and you mark the vertex you get into.
Always you get into a vertex, you should check if is possible to stay in it by using the hashes (theoretically O(1) checking), and if is possible, you add that value to the current sum, otherwise you go to the previous position (stored in the stack) and remove the weight you added when you went into the current vertex.
You should also store a global variable and always you walk K steps, you check if the current sum is smaller than the global sum, and if it is, you change it.
After you walk all possible ways, the global sum will be your answer.
Hope this helps :)

Related

Algorithm challenge to merge sets

Given n sets of numbers. Each set contains some numbers from 1 to 100. How to select sets to merge into the longest set under a special rule, only two non-overlapping sets can merge. [1,2,3] can merge with [4,5] but not [3,4]. What will be an efficient algorithm to merge into the longest set.
My first attempt is to form an n by n matrix. Each row/column represents a set. Entry(i,j) equals to 1 if two sets overlap, entry(i,i) stores the length of set i. Then the questions becomes can we perform row and column operations at the same time to create a diagonal sub-matrix on top left corner whose trace is as large as possible.
However, I got stuck in how to efficiently perform row and column operations to form such a diagonal sub-matrix on top left corner.
As already pointed out in the comments (maximum coverage problem) you have a NP-hart problem. Luckily, matlab offers solvers for integer linear programming.
So we try to reduce the problem to the form:
min f*x subject to Ax<=b , 0<=x
There are n sets, we can encode a set as a vector of 0s and 1s. For example (1,1,1,0,0,...) would represent {1,2,3} and (0,0,1,1,0,0...) - {3,4}.
Every column of A represents a set. A(i,j)=1 means that the i-th element is in the j-th set, A(i,j)=0 means that the i-th element is not in the j-th set.
Now, x represents the sets we select: if x_j=1 than the set j is selected, if x_j=0 - than not selected!
As every element must be at most in one set, we choose b=(1, 1, 1, ..., 1): If we take two sets which contain the i-th element, than the i-th element of (Ax) would be at least 2.
The only question left is what is f? We try to maximize the number of elements in the union, so we choose f_j=-|set_j| (minus owing to min<->max conversion), with |set_j| - number of elements in the j-th set.
Putting it all in matlab we get:
f=-sum(A)
xopt=intlinprog(f.',1:n,A,ones(m,1),[],[],zeros(n,1),ones(n,1))
f.' - cost function as column
1:n - all n elements of x are integers
A - encodes n sets
ones(m,1) - b=(1,1,1...), there are m=100 elements
[],[] - there are no constrains of the form Aeq*x=beq
zeros(n,1), 0<=x must hold
ones(n,1), x<=1 follows already from others constrains, but maybe it will help the solver a little bit
You can represent sets as bit fields. A bitwise and operation yielding zero would indicate non-overlapping sets. Depending on the width of the underlying data type, you may need to perform multiple and operations. For example, with a 64 bit machine word size, I'd need two words to cover 1 to 100 as a bit field.

how to apply dynamic programming in finding the minimum cost to create the tower in a field

you are given an N X M rectangular field with bottom left point at the origin. You have to construct a tower with square base in the field. There are trees in the field with associated cost to uproot them. So you have to minimize the number of trees uprooted to minimize the cost of constructing the tower.
Example Input:
N = 4
M = 3
Lenght of side of Tower = 1
Number of Trees in the field = 4
1 3 5
3 3 4
2 2 1
2 1 2
The 4 rows in the Input are the coordinates of the tree with cost for uprooting as the third integer.
Tree coinciding with the edge of the tower is considered as placed inside the tower and have to be uprooted as well.
I'm facing problem in formulating the Dynamic Programming relation for this problem
thanks
It sounds like your problem boils down to: find the KxK subblock of an MxN matrix with the smallest sum. You can solve this problem efficiently (proportional to the size of your input) by using an integral transform. Of course, this doesn't necessarily help you with your dynamic programming issue -- I'm not sure this solution is equivalent to any dynamic programming formulation....
At any rate, for each index pair (a,b) of your original matrix M, compute an "integral transform" matrix I[a,b] = sum[i<=a, j<=b](M[i,j]). This is computable by traversing the matrix in order, referring to the value computed from the previous row/column. (with a bit of thought, you can also do this efficiently with a sparse matrix)
Then, you can compute the sum of any subblock (a1..a2, b1..b2) in constant time as I[a2,b2] - I[a1-1,b2] - I[a2,b1-1] + I[a1-1,b1-1]. Iterating through all KxK subblocks to find the smallest sum will then take time proportional to the size of your original matrix also.
Since the original problem is phrased as a list of integral coordinates (and, presumably, expects the tower location to be output as an integral coordinate pair), you likely do need to represent your field as a sparse matrix for an efficient solution -- this involves sorting your trees' coordinates in lexicographic order (e.g. first by x-coordinate, then by y-coordinate). Note that this sorting step may take O(L log L) for input of size L, dominating the following steps, which take only O(L) in the size of the input.
Also note that, due to the problem specifying that "trees coinciding with the edge of the tower are uprooted...", a tower with edge length K actually corresponds to an (K+1)x(K+1) subblock.

How to find minimal subrectangle with at least K 1's in a 0-1 matrix

I have encountered an inoridinary problem that given a NxM 0-1 matrix and a number K(<=NxM) and I have to find a minimal subrectangle area of that 0-1 matrix with at least K 1's in inside that subrectangle. Furthermore it's area(the product of both dimensions) should be minimized.
For example:
00000
01010
00100
01010
00000
K = 3
So I can find a subrectangle with minimal area 6 that contains 3 1's inside.
10
01
10
NOTE that the target subrectangle that I mean should contains consecutive numbers of rows and columns from the original 0-1 matrix.
Compute cumulative sum of rows R[i,j] and columns C[i,j].
For top-left corner (i,j) of each possible sub-rectangle:
Starting from a single-row sub-rectangle (n=i),
Search the last possible column for this sub-rectangle (m).
While m>=j:
While there are more than 'k' "ones" in this sub-rectangle:
If this is the smallest sub-rectangle so far, remember it.
Remove column (--m).
This decreases the number of "ones" by C[m+1,n]-C[m+1,j-1].
Add next row (++n).
This increases the number of "ones" by R[m,n]-R[i-1,n].
Time complexity is O(NM(N+M)).
Two nested loops may be optimized by changing linear search to binary search (to process skinny sub-rectangles faster).
Also it is possible (after adding a row/a column to the sub-rectangle) to decrease in O(1) time the number of columns/rows in such a way that the area of this sub-rectangle is not larger than the area of the best-so-far sub-rectangle.
Both these optimizations require calculation of any sub-rectangle weight in O(1). To make it possible, pre-calculate cumulative sum of all elements for sub-rectangles [1..i,1..j] (X[i,j]). Then the weight of any sub-rectangle [i..m,j..n] is computed as X[m,n]-X[i-1,n]-X[m,j-1]+X[i-1,j-1].
Compute cumulative sum of columns C[i,j].
For any starting row (k) of possible sub-rectangle:
For any ending row (l) of possible sub-rectangle:
Starting column (m = 1).
Ending column (n = 1).
While n is not out-of-bounds
While there are less than 'k' "ones" in sub-rectangle [k..l,m..n]:
Add column (++n).
This increases the number of "ones" by C[l,n]-C[k-1,n].
If this is the smallest sub-rectangle so far, remember it.
Remove column (++m).
This decreases the number of "ones" by C[l,m-1]-C[k-1,m-1].
Time complexity is O(N2M).
Loop by 'l' may be terminated when all sub-rectangles, processed inside it, are single-column sub-rectangles (too many rows) or when no sub-rectangles, processed inside it, contain enough "ones" (not enough rows).
The problem is NP-hard because the clique decision problem can be reduced to it. So there is no algorithm that is more efficient than the brute-force approach of trying all the possible submatrices (unless P=NP).
The clique decision problem can be reduced to your problem in the following way:
Let the matrix be the adjacency matrix of the graph.
Set K=L^2, where L is the size of the clique we are looking for.
Solve your problem on this input. The graph contains an L-clique iff the solution to your problem is an LxL submatrix containing only ones (which can be checked in polynomial time).
Off the top of my head, you can make a list of the coordinate pairs(?) of all ones in the matrix, find the (smallest) containing sub-rectangles for each K-combination among them*, then pick the smallest of those.
* which is defined by the smallest and largest row and column indices in the K-combination.

From an interview: Removing rows and columns in an n×n matrix to maximize the sum of remaining values

Given an n×n matrix of real numbers. You are allowed to erase any number (from 0 to n) of rows and any number (from 0 to n) of columns, and after that the sum of the remaining entries is computed. Come up with an algorithm which finds out which rows and columns to erase in order to maximize that sum.
The problem is NP-hard. (So you should not expect a polynomial-time algorithm for solving this problem. There could still be (non-polynomial time) algorithms that are slightly better than brute-force, though.) The idea behind the proof of NP-hardness is that if we could solve this problem, then we could solve the the clique problem in a general graph. (The maximum-clique problem is to find the largest set of pairwise connected vertices in a graph.)
Specifically, given any graph with n vertices, let's form the matrix A with entries a[i][j] as follows:
a[i][j] = 1 for i == j (the diagonal entries)
a[i][j] = 0 if the edge (i,j) is present in the graph (and i≠j)
a[i][j] = -n-1 if the edge (i,j) is not present in the graph.
Now suppose we solve the problem of removing some rows and columns (or equivalently, keeping some rows and columns) so that the sum of the entries in the matrix is maximized. Then the answer gives the maximum clique in the graph:
Claim: In any optimal solution, there is no row i and column j kept for which the edge (i,j) is not present in the graph. Proof: Since a[i][j] = -n-1 and the sum of all the positive entries is at most n, picking (i,j) would lead to a negative sum. (Note that deleting all rows and columns would give a better sum, of 0.)
Claim: In (some) optimal solution, the set of rows and columns kept is the same. This is because starting with any optimal solution, we can simply remove all rows i for which column i has not been kept, and vice-versa. Note that since the only positive entries are the diagonal ones, we do not decrease the sum (and by the previous claim, we do not increase it either).
All of which means that if the graph has a maximum clique of size k, then our matrix problem has a solution with sum k, and vice-versa. Therefore, if we could solve our initial problem in polynomial time, then the clique problem would also be solved in polynomial time. This proves that the initial problem is NP-hard. (Actually, it is easy to see that the decision version of the initial problem — is there a way of removing some rows and columns so that the sum is at least k — is in NP, so the (decision version of the) initial problem is actually NP-complete.)
Well the brute force method goes something like this:
For n rows there are 2n subsets.
For n columns there are 2n subsets.
For an n x n matrix there are 22n subsets.
0 elements is a valid subset but obviously if you have 0 rows or 0 columns the total is 0 so there are really 22n-2+1 subsets but that's no different.
So you can work out each combination by brute force as an O(an) algorithm. Fast. :)
It would be quicker to work out what the maximum possible value is and you do that by adding up all the positive numbers in the grid. If those numbers happen to form a valid sub-matrix (meaning you can create that set by removing rows and/or columns) then there's your answer.
Implicit in this is that if none of the numbers are negative then the complete matrix is, by definition, the answer.
Also, knowing what the highest possible maximum is possibly allows you to shortcut the brute force evaluation since if you get any combination equal to that maximum then that is your answer and you can stop checking.
Also if all the numbers are non-positive, the answer is the maximum value as you can reduce the matrix to a 1 x 1 matrix with that 1 value in it, by definition.
Here's an idea: construct 2n-1 n x m matrices where 1 <= m <= n. Process them one after the other. For each n x m matrix you can calculate:
The highest possible maximum sum (as per above); and
Whether no numbers are positive allowing you to shortcut the answer.
if (1) is below the currently calculate highest maximum sum then you can discard this n x m matrix. If (2) is true then you just need a simple comparison to the current highest maximum sum.
This is generally referred to as a pruning technique.
What's more you can start by saying that the highest number in the n x n matrix is the starting highest maximum sum since obviously it can be a 1 x 1 matrix.
I'm sure you could tweak this into a (slightly more) efficient recursive tree-based search algorithm with the above tests effectively allowing you to eliminate (hopefully many) unnecessary searches.
We can improve on Cletus's generalized brute-force solution by modelling this as a directed graph. The initial matrix is the start node of the graph; its leaves are all the matrices missing one row or column, and so forth. It's a graph rather than a tree, because the node for the matrix without both the first column and row will have two parents - the nodes with just the first column or row missing.
We can optimize our solution by turning the graph into a tree: There's never any point exploring a submatrix with a column or row deleted that comes before the one we deleted to get to the current node, as that submatrix will be arrived at anyway.
This is still a brute-force search, of course - but we've eliminated the duplicate cases where we remove the same rows in different orders.
Here's an example implementation in Python:
def maximize_sum(m):
frontier = [(m, 0, False)]
best = None
best_score = 0
while frontier:
current, startidx, cols_done = frontier.pop()
score = matrix_sum(current)
if score > best_score or not best:
best = current
best_score = score
w, h = matrix_size(current)
if not cols_done:
for x in range(startidx, w):
frontier.append((delete_column(current, x), x, False))
startidx = 0
for y in range(startidx, h):
frontier.append((delete_row(current, y), y, True))
return best_score, best
And here's the output on 280Z28's example matrix:
>>> m = ((1, 1, 3), (1, -89, 101), (1, 102, -99))
>>> maximize_sum(m)
(106, [(1, 3), (1, 101)])
Since nobody asked for an efficient algorithm, use brute force: generate every possible matrix that can be created by removing rows and/or columns from the original matrix, choose the best one. A slightly more efficent version, which most likely can be proved to still be correct, is to generate only those variants where the removed rows and columns contain at least one negative value.
To try it in a simple way:
We need the valid subset of the set of entries {A00, A01, A02, ..., A0n, A10, ...,Ann} which max. sum.
First compute all subsets (the power set).
A valid subset is a member of the power set that for each two contained entries Aij and A(i+x)(j+y), contains also the elements A(i+x)j and Ai(j+y) (which are the remaining corners of the rectangle spanned by Aij and A(i+x)(j+y)).
Aij ...
. .
. .
... A(i+x)(j+y)
By that you can eliminate the invalid ones from the power set and find the one with the biggest sum in the remaining.
I'm sure it can be improved by improving an algorithm for power set generation in order to generate only valid subsets and by that avoiding step 2 (adjusting the power set).
I think there are some angles of attack that might improve upon brute force.
memoization, since there are many distinct sequences of edits that will arrive at the same submatrix.
dynamic programming. Because the search space of matrices is highly redundant, my intuition is that there would be a DP formulation that can save a lot of repeated work
I think there's a heuristic approach, but I can't quite nail it down:
if there's one negative number, you can either take the matrix as it is, remove the column of the negative number, or remove its row; I don't think any other "moves" result in a higher sum. For two negative numbers, your options are: remove neither, remove one, remove the other, or remove both (where the act of removal is either by axing the row or the column).
Now suppose the matrix has only one positive number and the rest are all <=0. You clearly want to remove everything but the positive entry. For a matrix with only 2 positive entries and the rest <= 0, the options are: do nothing, whittle down to one, whittle down to the other, or whittle down to both (resulting in a 1x2, 2x1, or 2x2 matrix).
In general this last option falls apart (imagine a matrix with 50 positives & 50 negatives), but depending on your data (few negatives or few positives) it could provide a shortcut.
Create an n-by-1 vector RowSums, and an n-by-1 vector ColumnSums. Initialize them to the row and column sums of the original matrix. O(n²)
If any row or column has a negative sum, remove edit: the one with the minimum such and update the sums in the other direction to reflect their new values. O(n)
Stop when no row or column has a sum less than zero.
This is an iterative variation improving on another answer. It operates in O(n²) time, but fails for some cases mentioned in other answers, which is the complexity limit for this problem (there are n² entries in the matrix, and to even find the minimum you have to examine each cell once).
Edit: The following matrix has no negative rows or columns, but is also not maximized, and my algorithm doesn't catch it.
1 1 3 goal 1 3
1 -89 101 ===> 1 101
1 102 -99
The following matrix does have negative rows and columns, but my algorithm selects the wrong ones for removal.
-5 1 -5 goal 1
1 1 1 ===> 1
-10 2 -10 2
mine
===> 1 1 1
Compute the sum of each row and column. This can be done in O(m) (where m = n^2)
While there are rows or columns that sum to negative remove the row or column that has the lowest sum that is less than zero. Then recompute the sum of each row/column.
The general idea is that as long as there is a row or a column that sums to nevative, removing it will result in a greater overall value. You need to remove them one at a time and recompute because in removing that one row/column you are affecting the sums of the other rows/columns and they may or may not have negative sums any more.
This will produce an optimally maximum result. Runtime is O(mn) or O(n^3)
I cannot really produce an algorithm on top of my head, but to me it 'smells' like dynamic programming, if it serves as a start point.
Big Edit: I honestly don't think there's a way to assess a matrix and determine it is maximized, unless it is completely positive.
Maybe it needs to branch, and fathom all elimination paths. You never no when a costly elimination will enable a number of better eliminations later. We can short circuit if it's found the theoretical maximum, but other than any algorithm would have to be able to step forward and back. I've adapted my original solution to achieve this behaviour with recursion.
Double Secret Edit: It would also make great strides to reduce to complexity if each iteration didn't need to find all negative elements. Considering that they don't change much between calls, it makes more sense to just pass their positions to the next iteration.
Takes a matrix, the list of current negative elements in the matrix, and the theoretical maximum of the initial matrix. Returns the matrix's maximum sum and the list of moves required to get there. In my mind move list contains a list of moves denoting the row/column removed from the result of the previous operation.
Ie: r1,r1
Would translate
-1 1 0 1 1 1
-4 1 -4 5 7 1
1 2 4 ===>
5 7 1
Return if sum of matrix is the theoretical maximum
Find the positions of all negative elements unless an empty set was passed in.
Compute sum of matrix and store it along side an empty move list.
For negative each element:
Calculate the sum of that element's row and column.
clone the matrix and eliminate which ever collection has the minimum sum (row/column) from that clone, note that action as a move list.
clone the list of negative elements and remove any that are effected by the action taken in the previous step.
Recursively call this algorithm providing the cloned matrix, the updated negative element list and the theoretical maximum. Append the moves list returned to the move list for the action that produced the matrix passed to the recursive call.
If the returned value of the recursive call is greater than the stored sum, replace it and store the returned move list.
Return the stored sum and move list.
I'm not sure if it's better or worse than the brute force method, but it handles all the test cases now. Even those where the maximum contains negative values.
This is an optimization problem and can be solved approximately by an iterative algorithm based on simulated annealing:
Notation: C is number of columns.
For J iterations:
Look at each column and compute the absolute benefit of toggling it (turn it off if it's currently on or turn it on if it's currently off). That gives you C values, e.g. -3, 1, 4. A greedy deterministic solution would just pick the last action (toggle the last column to get a benefit of 4) because it locally improves the objective. But that might lock us into a local optimum. Instead, we probabilistically pick one of the three actions, with probabilities proportional to the benefits. To do this, transform them into a probability distribution by putting them through a Sigmoid function and normalizing. (Or use exp() instead of sigmoid()?) So for -3, 1, 4 you get 0.05, 0.73, 0.98 from the sigmoid and 0.03, 0.42, 0.56 after normalizing. Now pick the action according to the probability distribution, e.g. toggle the last column with probability 0.56, toggle the second column with probability 0.42, or toggle the first column with the tiny probability 0.03.
Do the same procedure for the rows, resulting in toggling one of the rows.
Iterate for J iterations until convergence.
We may also, in early iterations, make each of these probability distributions more uniform, so that we don't get locked into bad decisions early on. So we'd raise the unnormalized probabilities to a power 1/T, where T is high in early iterations and is slowly decreased until it approaches 0. For example, 0.05, 0.73, 0.98 from above, raised to 1/10 results in 0.74, 0.97, 1.0, which after normalization is 0.27, 0.36, 0.37 (so it's much more uniform than the original 0.05, 0.73, 0.98).
It's clearly NP-Complete (as outlined above). Given this, if I had to propose the best algorithm I could for the problem:
Try some iterations of quadratic integer programming, formulating the problem as: SUM_ij a_ij x_i y_j, with the x_i and y_j variables constrained to be either 0 or 1. For some matrices I think this will find a solution quickly, for the hardest cases it would be no better than brute force (and not much would be).
In parallel (and using most of the CPU), use a approximate search algorithm to generate increasingly better solutions. Simulating Annealing was suggested in another answer, but having done research on similar combinatorial optimisation problems, my experience is that tabu search would find good solutions faster. This is probably close to optimal in terms of wandering between distinct "potentially better" solutions in the shortest time, if you use the trick of incrementally updating the costs of single changes (see my paper "Graph domination, tabu search and the football pool problem").
Use the best solution so far from the second above to steer the first by avoiding searching possibilities that have lower bounds worse than it.
Obviously this isn't guaranteed to find the maximal solution. But, it generally would when this is feasible, and it would provide a very good locally maximal solution otherwise. If someone had a practical situation requiring such optimisation, this is the solution that I'd think would work best.
Stopping at identifying that a problem is likely to be NP-Complete will not look good in a job interview! (Unless the job is in complexity theory, but even then I wouldn't.) You need to suggest good approaches - that is the point of a question like this. To see what you can come up with under pressure, because the real world often requires tackling such things.
yes, it's NP-complete problem.
It's hard to easily find the best sub-matrix,but we can easily to find some better sub-matrix.
Assume that we give m random points in the matrix as "feeds". then let them to automatically extend by the rules like :
if add one new row or column to the feed-matrix, ensure that the sum will be incrementive.
,then we can compare m sub-matrix to find the best one.
Let's say n = 10.
Brute force (all possible sets of rows x all possible sets of columns) takes
2^10 * 2^10 =~ 1,000,000 nodes.
My first approach was to consider this a tree search, and use
the sum of positive entries is an upper bound for every node in the subtree
as a pruning method. Combined with a greedy algorithm to cheaply generate good initial bounds, this yielded answers in about 80,000 nodes on average.
but there is a better way ! i later realised that
Fix some choice of rows X.
Working out the optimal columns for this set of rows is now trivial (keep a column if its sum of its entries in the rows X is positive, otherwise discard it).
So we can just brute force over all possible choices of rows; this takes 2^10 = 1024 nodes.
Adding the pruning method brought this down to 600 nodes on average.
Keeping 'column-sums' and incrementally updating them when traversing the tree of row-sets should allow the calculations (sum of matrix etc) at each node to be O(n) instead of O(n^2). Giving a total complexity of O(n * 2^n)
For slightly less than optimal solution, I think this is a PTIME, PSPACE complexity issue.
The GREEDY algorithm could run as follows:
Load the matrix into memory and compute row totals. After that run the main loop,
1) Delete the smallest row,
2) Subtract the newly omitted values from the old row totals
--> Break when there are no more negative rows.
Point two is a subtle detail: subtracted two rows/columns has time complexity n.
While re-summing all but two columns has n^2 time complexity!
Take each row and each column and compute the sum. For a 2x2 matrix this will be:
2 1
3 -10
Row(0) = 3
Row(1) = -7
Col(0) = 5
Col(1) = -9
Compose a new matrix
Cost to take row Cost to take column
3 5
-7 -9
Take out whatever you need to, then start again.
You just look for negative values on the new matrix. Those are values that actually substract from the overall matrix value. It terminates when there're no more negative "SUMS" values to take out (therefore all columns and rows SUM something to the final result)
In an nxn matrix that would be O(n^2)Log(n) I think
function pruneMatrix(matrix) {
max = -inf;
bestRowBitField = null;
bestColBitField = null;
for(rowBitField=0; rowBitField<2^matrix.height; rowBitField++) {
for (colBitField=0; colBitField<2^matrix.width; colBitField++) {
sum = calcSum(matrix, rowBitField, colBitField);
if (sum > max) {
max = sum;
bestRowBitField = rowBitField;
bestColBitField = colBitField;
}
}
}
return removeFieldsFromMatrix(bestRowBitField, bestColBitField);
}
function calcSumForCombination(matrix, rowBitField, colBitField) {
sum = 0;
for(i=0; i<matrix.height; i++) {
for(j=0; j<matrix.width; j++) {
if (rowBitField & 1<<i && colBitField & 1<<j) {
sum += matrix[i][j];
}
}
}
return sum;
}

Find the "largest" dense sub matrix in a large sparse matrix

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough
I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.
I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.
Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.
EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Resources