Find the "largest" dense sub matrix in a large sparse matrix - algorithm

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough

I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.

I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.

Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.

EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Related

Find smallest sum of values in matrix using row index and colum index once

So I want to find the smallest values in a matrix in the following way.
[[ 1000. 930. 940. 740.]
[ 1000. 1000. 990. 670.]
M1= [ 1000. 1000. 1000. 680.]
[ 1000. 1000. 1000. 1000.]]
The sum of 2 matrix values should be chosen in such a way that the indexes are used once 0,1,2,3. But also the sum of matrix values should be minimized.
So in this case the solution would be M1[2][3] and M1[0][1].
Incorrect would be M1[2][3] and M1[1][3], which hase a lower sum but is does not contain unique index numbers.
The solution should work for NxN matrices, N is even. So for 8x8 matrix, i want to find 4 elements. So that the index Numbers. 0,1,2,3,4,5,6,7 are uses once. So four matrix values.
Another constraint is that the matrix contains only values of intrest in the upper trangle matrix. So were the matrix elements are 1000, these elements can be ignored in finding the minimum sum.
I have tried to alter the Hungarian algorithm, but this was not successful.
Does anybody know of an algorithm that does what I want? Maybe a python package wich I can abuse
Or has a smart solution which would help, I have to do this matrix with about 200X200 elements max.
I will say a solution that is probably not the fastest but it may work.
You can build a graph this way:
the graph will contain (N×N+1) vertexes, which represent the indexes of the matrix and a new one, which will be the source
the source will be connected to all other vertexes with a distance equivalent to the value of the index each of them represents.
then you must connect each vertex (except the source) to every other vertex that is possible to go to (for example, M1[1][2] can go to M1[0][3] but not to M1[1][3]). The distance from any vertex to a vertex V will correspond to the value of V in the matrix.
after you build this graph, you should walk on it K steps (being K the number of possible matrix' indexes you will consider, for example, 2 in a 4x4 matrix like your example).
For each step you take, you store in a stack and in 2 hashes the last position you were (the first to store all rows already used, the second to store all columns already used) and you mark the vertex you get into.
Always you get into a vertex, you should check if is possible to stay in it by using the hashes (theoretically O(1) checking), and if is possible, you add that value to the current sum, otherwise you go to the previous position (stored in the stack) and remove the weight you added when you went into the current vertex.
You should also store a global variable and always you walk K steps, you check if the current sum is smaller than the global sum, and if it is, you change it.
After you walk all possible ways, the global sum will be your answer.
Hope this helps :)

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Obtaining the minimum number of tails of coin after flipping the entire row or column multiple times

If coins are placed on a grid and only an entire row or column can be flipped, how can we flip the coins to obtain the minimum number of tails.
I tried to using the greedy solution, in which I flip the row or column where the number of tails are greater than heads and repeat the process until there exists no change on the number. But I found that this approach does not give me an optimal solution in some times.
HHT
THH
THT
For example, if the coins are placed like the above and I flip the coins in below manner, the obtained value is 3 but actually the answer is 2.
1. Flip the row 3
HHT
THH
HTH
2. Then there exists no row or column where the number of tails are greater than that of heads.
3. But if I flip the column 3, row 3, column 1, there exists a solution whose value is 2.
THH
HHT
HHH
So, I think the above algorithm doesn't work. What approach and what algorithm should I use?
First let us notice that there is no point in flipping the same row or column twice or more (a better solution is always either flipping the row/column zero or one time), and the order we flip the rows or columns is irrelevant, so we can describe a solution as a bit array of length 2N. One bit per row and one bit per column. On if we flip that row/column once, off if we flip it zero times.
So we need to search 2^(2N) possible solutions, prefering solutions with more zeros.
Secondly let us notice that for one solution there are four possible states of a coin:
The coin was not flipped (0 flips)
The coin was flipped by its row (1 flip)
The coin was flipped by its column (1 flip)
The coin was flipped by both its row and column (2 flips)
Notice that state 1 and 4 result in the original value of the coin
Also notice that state 2 and 3 result in the opposite of the original value of the coin
Start by expressing the original state of the coins as a binary matrix (B). The 2N-bit field as 2 binary vectors (R, C), and the total number of tails as a function of this f(B, R, C), and the total number of bits as a function g(V_1, V_2)
So your goal is to make f >= minimum while minimizing g.
Think that if we first fix our R configuration (which rows we will flip), how can we solve the problem just for C (which columns we will flip)? Put another way, consider the simpler problem of only being allowed to flip columns, and not being allowed to flip rows. How would you solve this? (hint: DP) Can you extend this stategy back to the full problem now?
Not sure about the complete algorithm, but one thing you should definitely try exploit here are the large number of symmetries in your problem.
A lot of different coin configurations will actually be equivalent, so you can rotate, mirror your configuration without altering the problem. Most importantly, since you can reverse the whole set by flipping all rows, looking for the minimum number of tails is equivalent to looking for the minimum number of heads.
In your case, it would be
HHT
THH
THT
HTT
TTH
TTT
By flipping the middle column, and you're done (you then have to flip everything of course if you really need it).
An obvious solution is to try all possibilities of flipping a row or a column. There are O(2^(2N)) such possibilities. However, we can solve the problem in O(N^2 * 2^N) with a combination of greedy + brute force.
Generate all possibilities of flipping the rows (O(2^N)) and for each of these, flip each column that has more tails than heads. Take the solution that gives you the minimum tails.
This should work. I will add more details about why a bit later.
One approach would be to use http://en.wikipedia.org/wiki/Branch_and_bound, alternately considering new vertical lines and new horizontal lines. There is also some symmetry you can remove - if you flip all the horizontal lines and all the vertical line, you will end up back where you started, so with branch and bound you might as well arbitrarily assume that the leftmost vertical line is never flipped.
HHT
THH
THT
In this example, if we assume that the leftmost vertical line is not flipped, then if we branch on the lowest horizontal line we know the value of the leftmost lowest coin, so we have two possible partial solutions - one in which that single known coin is fixed at tails, and one in which it is fixed at heads. If we recurse first to try and extend the partial solution in which the single known coin is heads and find that we can extend this to a solution that produces no tails, then we can discard all the partial solutions produced by extending the other, because all its descendants must have at least one tail.
I would next branch on the leftmost but one vertical line, which will give us another known coin, and continue branching alternately horizontally and vertically.
This will be a feasible way of finding an exact solution if there is a nearly perfect solution or if the table is very small. Otherwise you will have to stop it early or have it skip credible solutions to get the problem finished in a reasonable time, and you will probably not get the exact best answer.

finding the count of cells in a given 2d array satisfying the given constraints

Given a 2-D array starting at (0,0) and proceeding to infinity in positive x and y axes. Given a number k>0 , find the number of cells reachable from (0,0) such that at every moment -> sum of digits of x+ sum of digits of y <=k . Moves can be up, down ,left or right. given x,y>=0 . Dfs gives answers but not sufficient for large values of k. anyone can help me with a better algorithm for this?
I think they asked you to calculate the number of cells (x,y) reachable with k>=x+y. If x=1 for example, then y can take any number between 0 and k-1 and the sum would be <=k. The total number of possibilities can be calculated by
sum(sum(1,y=0..k-x),x=0..k) = 1/2*k²+3/2*k+1
That should be able to do the trick for large k.
I am somewhat confused by the "digits" in your question. The digits make up the index like 3 times 9 makes 999. The sum of digits for the cell (999,888) would be 51. If you would allow the sum of digits to be 10^9 then you could potentially have 10^8 digits for an index, resulting something around 10^(10^8) entries, well beyond normal sizes for a table. I am therefore assuming my first interpretation. If that's not correct, then could you explain it a bit more?
EDIT:
okay, so my answer is not going to solve it. I'm afraid I don't see a nice formula or answer. I would approach it as a coloring/marking problem and mark all valid cells, then use some other technique to make sure all the parts are connected/to count them.
I have tried to come up with something but it's too messy. Basically I would try and mark large parts at once based on the index and k. If k=20, you can mark the cell range (0,0..299) at once (as any lower index will have a lower index sum) and continue to check the rest of the range. I start with 299 by fixing the 2 last digits to their maximum value and look for the max value for the first digit. Then continue that process for the remaining hundreds (300-999) and only fix the last digit to end up with 300..389 and 390..398. However, you can already see that it's a mess... (nevertheless i wanted to give it to you, you might get some better idea)
Another thing you can see immediately is that you problem is symmetric in index so any valid cell (x,y) tells you there's another valid cell (y,x). In a marking scheme / dfs/ bfs this can be exploited.

Get most unique text from a group of text

I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.

Resources