From an interview: Removing rows and columns in an n×n matrix to maximize the sum of remaining values - algorithm

Given an n×n matrix of real numbers. You are allowed to erase any number (from 0 to n) of rows and any number (from 0 to n) of columns, and after that the sum of the remaining entries is computed. Come up with an algorithm which finds out which rows and columns to erase in order to maximize that sum.

The problem is NP-hard. (So you should not expect a polynomial-time algorithm for solving this problem. There could still be (non-polynomial time) algorithms that are slightly better than brute-force, though.) The idea behind the proof of NP-hardness is that if we could solve this problem, then we could solve the the clique problem in a general graph. (The maximum-clique problem is to find the largest set of pairwise connected vertices in a graph.)
Specifically, given any graph with n vertices, let's form the matrix A with entries a[i][j] as follows:
a[i][j] = 1 for i == j (the diagonal entries)
a[i][j] = 0 if the edge (i,j) is present in the graph (and i≠j)
a[i][j] = -n-1 if the edge (i,j) is not present in the graph.
Now suppose we solve the problem of removing some rows and columns (or equivalently, keeping some rows and columns) so that the sum of the entries in the matrix is maximized. Then the answer gives the maximum clique in the graph:
Claim: In any optimal solution, there is no row i and column j kept for which the edge (i,j) is not present in the graph. Proof: Since a[i][j] = -n-1 and the sum of all the positive entries is at most n, picking (i,j) would lead to a negative sum. (Note that deleting all rows and columns would give a better sum, of 0.)
Claim: In (some) optimal solution, the set of rows and columns kept is the same. This is because starting with any optimal solution, we can simply remove all rows i for which column i has not been kept, and vice-versa. Note that since the only positive entries are the diagonal ones, we do not decrease the sum (and by the previous claim, we do not increase it either).
All of which means that if the graph has a maximum clique of size k, then our matrix problem has a solution with sum k, and vice-versa. Therefore, if we could solve our initial problem in polynomial time, then the clique problem would also be solved in polynomial time. This proves that the initial problem is NP-hard. (Actually, it is easy to see that the decision version of the initial problem — is there a way of removing some rows and columns so that the sum is at least k — is in NP, so the (decision version of the) initial problem is actually NP-complete.)

Well the brute force method goes something like this:
For n rows there are 2n subsets.
For n columns there are 2n subsets.
For an n x n matrix there are 22n subsets.
0 elements is a valid subset but obviously if you have 0 rows or 0 columns the total is 0 so there are really 22n-2+1 subsets but that's no different.
So you can work out each combination by brute force as an O(an) algorithm. Fast. :)
It would be quicker to work out what the maximum possible value is and you do that by adding up all the positive numbers in the grid. If those numbers happen to form a valid sub-matrix (meaning you can create that set by removing rows and/or columns) then there's your answer.
Implicit in this is that if none of the numbers are negative then the complete matrix is, by definition, the answer.
Also, knowing what the highest possible maximum is possibly allows you to shortcut the brute force evaluation since if you get any combination equal to that maximum then that is your answer and you can stop checking.
Also if all the numbers are non-positive, the answer is the maximum value as you can reduce the matrix to a 1 x 1 matrix with that 1 value in it, by definition.
Here's an idea: construct 2n-1 n x m matrices where 1 <= m <= n. Process them one after the other. For each n x m matrix you can calculate:
The highest possible maximum sum (as per above); and
Whether no numbers are positive allowing you to shortcut the answer.
if (1) is below the currently calculate highest maximum sum then you can discard this n x m matrix. If (2) is true then you just need a simple comparison to the current highest maximum sum.
This is generally referred to as a pruning technique.
What's more you can start by saying that the highest number in the n x n matrix is the starting highest maximum sum since obviously it can be a 1 x 1 matrix.
I'm sure you could tweak this into a (slightly more) efficient recursive tree-based search algorithm with the above tests effectively allowing you to eliminate (hopefully many) unnecessary searches.

We can improve on Cletus's generalized brute-force solution by modelling this as a directed graph. The initial matrix is the start node of the graph; its leaves are all the matrices missing one row or column, and so forth. It's a graph rather than a tree, because the node for the matrix without both the first column and row will have two parents - the nodes with just the first column or row missing.
We can optimize our solution by turning the graph into a tree: There's never any point exploring a submatrix with a column or row deleted that comes before the one we deleted to get to the current node, as that submatrix will be arrived at anyway.
This is still a brute-force search, of course - but we've eliminated the duplicate cases where we remove the same rows in different orders.
Here's an example implementation in Python:
def maximize_sum(m):
frontier = [(m, 0, False)]
best = None
best_score = 0
while frontier:
current, startidx, cols_done = frontier.pop()
score = matrix_sum(current)
if score > best_score or not best:
best = current
best_score = score
w, h = matrix_size(current)
if not cols_done:
for x in range(startidx, w):
frontier.append((delete_column(current, x), x, False))
startidx = 0
for y in range(startidx, h):
frontier.append((delete_row(current, y), y, True))
return best_score, best
And here's the output on 280Z28's example matrix:
>>> m = ((1, 1, 3), (1, -89, 101), (1, 102, -99))
>>> maximize_sum(m)
(106, [(1, 3), (1, 101)])

Since nobody asked for an efficient algorithm, use brute force: generate every possible matrix that can be created by removing rows and/or columns from the original matrix, choose the best one. A slightly more efficent version, which most likely can be proved to still be correct, is to generate only those variants where the removed rows and columns contain at least one negative value.

To try it in a simple way:
We need the valid subset of the set of entries {A00, A01, A02, ..., A0n, A10, ...,Ann} which max. sum.
First compute all subsets (the power set).
A valid subset is a member of the power set that for each two contained entries Aij and A(i+x)(j+y), contains also the elements A(i+x)j and Ai(j+y) (which are the remaining corners of the rectangle spanned by Aij and A(i+x)(j+y)).
Aij ...
. .
. .
... A(i+x)(j+y)
By that you can eliminate the invalid ones from the power set and find the one with the biggest sum in the remaining.
I'm sure it can be improved by improving an algorithm for power set generation in order to generate only valid subsets and by that avoiding step 2 (adjusting the power set).

I think there are some angles of attack that might improve upon brute force.
memoization, since there are many distinct sequences of edits that will arrive at the same submatrix.
dynamic programming. Because the search space of matrices is highly redundant, my intuition is that there would be a DP formulation that can save a lot of repeated work
I think there's a heuristic approach, but I can't quite nail it down:
if there's one negative number, you can either take the matrix as it is, remove the column of the negative number, or remove its row; I don't think any other "moves" result in a higher sum. For two negative numbers, your options are: remove neither, remove one, remove the other, or remove both (where the act of removal is either by axing the row or the column).
Now suppose the matrix has only one positive number and the rest are all <=0. You clearly want to remove everything but the positive entry. For a matrix with only 2 positive entries and the rest <= 0, the options are: do nothing, whittle down to one, whittle down to the other, or whittle down to both (resulting in a 1x2, 2x1, or 2x2 matrix).
In general this last option falls apart (imagine a matrix with 50 positives & 50 negatives), but depending on your data (few negatives or few positives) it could provide a shortcut.

Create an n-by-1 vector RowSums, and an n-by-1 vector ColumnSums. Initialize them to the row and column sums of the original matrix. O(n²)
If any row or column has a negative sum, remove edit: the one with the minimum such and update the sums in the other direction to reflect their new values. O(n)
Stop when no row or column has a sum less than zero.
This is an iterative variation improving on another answer. It operates in O(n²) time, but fails for some cases mentioned in other answers, which is the complexity limit for this problem (there are n² entries in the matrix, and to even find the minimum you have to examine each cell once).
Edit: The following matrix has no negative rows or columns, but is also not maximized, and my algorithm doesn't catch it.
1 1 3 goal 1 3
1 -89 101 ===> 1 101
1 102 -99
The following matrix does have negative rows and columns, but my algorithm selects the wrong ones for removal.
-5 1 -5 goal 1
1 1 1 ===> 1
-10 2 -10 2
mine
===> 1 1 1

Compute the sum of each row and column. This can be done in O(m) (where m = n^2)
While there are rows or columns that sum to negative remove the row or column that has the lowest sum that is less than zero. Then recompute the sum of each row/column.
The general idea is that as long as there is a row or a column that sums to nevative, removing it will result in a greater overall value. You need to remove them one at a time and recompute because in removing that one row/column you are affecting the sums of the other rows/columns and they may or may not have negative sums any more.
This will produce an optimally maximum result. Runtime is O(mn) or O(n^3)

I cannot really produce an algorithm on top of my head, but to me it 'smells' like dynamic programming, if it serves as a start point.

Big Edit: I honestly don't think there's a way to assess a matrix and determine it is maximized, unless it is completely positive.
Maybe it needs to branch, and fathom all elimination paths. You never no when a costly elimination will enable a number of better eliminations later. We can short circuit if it's found the theoretical maximum, but other than any algorithm would have to be able to step forward and back. I've adapted my original solution to achieve this behaviour with recursion.
Double Secret Edit: It would also make great strides to reduce to complexity if each iteration didn't need to find all negative elements. Considering that they don't change much between calls, it makes more sense to just pass their positions to the next iteration.
Takes a matrix, the list of current negative elements in the matrix, and the theoretical maximum of the initial matrix. Returns the matrix's maximum sum and the list of moves required to get there. In my mind move list contains a list of moves denoting the row/column removed from the result of the previous operation.
Ie: r1,r1
Would translate
-1 1 0 1 1 1
-4 1 -4 5 7 1
1 2 4 ===>
5 7 1
Return if sum of matrix is the theoretical maximum
Find the positions of all negative elements unless an empty set was passed in.
Compute sum of matrix and store it along side an empty move list.
For negative each element:
Calculate the sum of that element's row and column.
clone the matrix and eliminate which ever collection has the minimum sum (row/column) from that clone, note that action as a move list.
clone the list of negative elements and remove any that are effected by the action taken in the previous step.
Recursively call this algorithm providing the cloned matrix, the updated negative element list and the theoretical maximum. Append the moves list returned to the move list for the action that produced the matrix passed to the recursive call.
If the returned value of the recursive call is greater than the stored sum, replace it and store the returned move list.
Return the stored sum and move list.
I'm not sure if it's better or worse than the brute force method, but it handles all the test cases now. Even those where the maximum contains negative values.

This is an optimization problem and can be solved approximately by an iterative algorithm based on simulated annealing:
Notation: C is number of columns.
For J iterations:
Look at each column and compute the absolute benefit of toggling it (turn it off if it's currently on or turn it on if it's currently off). That gives you C values, e.g. -3, 1, 4. A greedy deterministic solution would just pick the last action (toggle the last column to get a benefit of 4) because it locally improves the objective. But that might lock us into a local optimum. Instead, we probabilistically pick one of the three actions, with probabilities proportional to the benefits. To do this, transform them into a probability distribution by putting them through a Sigmoid function and normalizing. (Or use exp() instead of sigmoid()?) So for -3, 1, 4 you get 0.05, 0.73, 0.98 from the sigmoid and 0.03, 0.42, 0.56 after normalizing. Now pick the action according to the probability distribution, e.g. toggle the last column with probability 0.56, toggle the second column with probability 0.42, or toggle the first column with the tiny probability 0.03.
Do the same procedure for the rows, resulting in toggling one of the rows.
Iterate for J iterations until convergence.
We may also, in early iterations, make each of these probability distributions more uniform, so that we don't get locked into bad decisions early on. So we'd raise the unnormalized probabilities to a power 1/T, where T is high in early iterations and is slowly decreased until it approaches 0. For example, 0.05, 0.73, 0.98 from above, raised to 1/10 results in 0.74, 0.97, 1.0, which after normalization is 0.27, 0.36, 0.37 (so it's much more uniform than the original 0.05, 0.73, 0.98).

It's clearly NP-Complete (as outlined above). Given this, if I had to propose the best algorithm I could for the problem:
Try some iterations of quadratic integer programming, formulating the problem as: SUM_ij a_ij x_i y_j, with the x_i and y_j variables constrained to be either 0 or 1. For some matrices I think this will find a solution quickly, for the hardest cases it would be no better than brute force (and not much would be).
In parallel (and using most of the CPU), use a approximate search algorithm to generate increasingly better solutions. Simulating Annealing was suggested in another answer, but having done research on similar combinatorial optimisation problems, my experience is that tabu search would find good solutions faster. This is probably close to optimal in terms of wandering between distinct "potentially better" solutions in the shortest time, if you use the trick of incrementally updating the costs of single changes (see my paper "Graph domination, tabu search and the football pool problem").
Use the best solution so far from the second above to steer the first by avoiding searching possibilities that have lower bounds worse than it.
Obviously this isn't guaranteed to find the maximal solution. But, it generally would when this is feasible, and it would provide a very good locally maximal solution otherwise. If someone had a practical situation requiring such optimisation, this is the solution that I'd think would work best.
Stopping at identifying that a problem is likely to be NP-Complete will not look good in a job interview! (Unless the job is in complexity theory, but even then I wouldn't.) You need to suggest good approaches - that is the point of a question like this. To see what you can come up with under pressure, because the real world often requires tackling such things.

yes, it's NP-complete problem.
It's hard to easily find the best sub-matrix,but we can easily to find some better sub-matrix.
Assume that we give m random points in the matrix as "feeds". then let them to automatically extend by the rules like :
if add one new row or column to the feed-matrix, ensure that the sum will be incrementive.
,then we can compare m sub-matrix to find the best one.

Let's say n = 10.
Brute force (all possible sets of rows x all possible sets of columns) takes
2^10 * 2^10 =~ 1,000,000 nodes.
My first approach was to consider this a tree search, and use
the sum of positive entries is an upper bound for every node in the subtree
as a pruning method. Combined with a greedy algorithm to cheaply generate good initial bounds, this yielded answers in about 80,000 nodes on average.
but there is a better way ! i later realised that
Fix some choice of rows X.
Working out the optimal columns for this set of rows is now trivial (keep a column if its sum of its entries in the rows X is positive, otherwise discard it).
So we can just brute force over all possible choices of rows; this takes 2^10 = 1024 nodes.
Adding the pruning method brought this down to 600 nodes on average.
Keeping 'column-sums' and incrementally updating them when traversing the tree of row-sets should allow the calculations (sum of matrix etc) at each node to be O(n) instead of O(n^2). Giving a total complexity of O(n * 2^n)

For slightly less than optimal solution, I think this is a PTIME, PSPACE complexity issue.
The GREEDY algorithm could run as follows:
Load the matrix into memory and compute row totals. After that run the main loop,
1) Delete the smallest row,
2) Subtract the newly omitted values from the old row totals
--> Break when there are no more negative rows.
Point two is a subtle detail: subtracted two rows/columns has time complexity n.
While re-summing all but two columns has n^2 time complexity!

Take each row and each column and compute the sum. For a 2x2 matrix this will be:
2 1
3 -10
Row(0) = 3
Row(1) = -7
Col(0) = 5
Col(1) = -9
Compose a new matrix
Cost to take row Cost to take column
3 5
-7 -9
Take out whatever you need to, then start again.
You just look for negative values on the new matrix. Those are values that actually substract from the overall matrix value. It terminates when there're no more negative "SUMS" values to take out (therefore all columns and rows SUM something to the final result)
In an nxn matrix that would be O(n^2)Log(n) I think

function pruneMatrix(matrix) {
max = -inf;
bestRowBitField = null;
bestColBitField = null;
for(rowBitField=0; rowBitField<2^matrix.height; rowBitField++) {
for (colBitField=0; colBitField<2^matrix.width; colBitField++) {
sum = calcSum(matrix, rowBitField, colBitField);
if (sum > max) {
max = sum;
bestRowBitField = rowBitField;
bestColBitField = colBitField;
}
}
}
return removeFieldsFromMatrix(bestRowBitField, bestColBitField);
}
function calcSumForCombination(matrix, rowBitField, colBitField) {
sum = 0;
for(i=0; i<matrix.height; i++) {
for(j=0; j<matrix.width; j++) {
if (rowBitField & 1<<i && colBitField & 1<<j) {
sum += matrix[i][j];
}
}
}
return sum;
}

Related

Find minimum steps to convert all elements to zero

You are given an array of positive integers of size N. You can choose any positive number x such that x<=max(Array) and subtract it from all elements of the array greater than and equal to x.
This operation has a cost A[i]-x for A[i]>=x. The total cost for a particular step is the
sum(A[i]-x). A step is only valid if the sum(A[i]-x) is less than or equal to a given number K.
For all the valid steps find the minimum number of steps to make all elements of the array zero.
0<=i<10^5
0<=x<=10^5
0<k<10^5
Can anybody help me with any approach? DP will not work due to high constraints.
Just some general exploratory thoughts.
First, there should be a constraint on N. If N is 3, this is much easier than if it is 100. The naive brute force approach is going to be O(k^N)
Next, you are right that DP will not work with these constraints.
For a greedy approach, I would want to minimize the number of distinct non-zero values, and not maximize how much I took. Our worst case approach is take out the largest each time, for N steps. If you can get 2 pairs of entries to both match, then that shortened our approach.
The obvious thing to try if you can is an A* search. However that requires a LOWER bound (not upper). The best naive lower bound that I can see is ceil(log_2(count_distinct_values)). Unless you're incredibly lucky and the problem can be solved that quickly, this is unlikely to narrow your search enough to be helpful.
I'm curious what trick makes this problem actually doable.
I do have an idea. But it is going to take some thought to make it work. Naively we want to take each choice for x and explore the paths that way. And this is a problem because there are 10^5 choices for x. After 2 choices we have a problem, and after 3 we are definitely not going to be able to do it.
BUT instead consider the possible orders of the array elements (with ties both possible and encouraged) and the resulting inequalities on the range of choices that could have been made. And now instead of having to store a 10^5 choices of x we only need store the distinct orderings we get, and what inequalities there are on the range of choices that get us there. As long as N < 10, the number of weak orderings is something that we can deal with if we're clever.
It would take a bunch of work to flesh out this idea though.
I may be totally wrong, and if so, please tell me and I'm going to delete my thoughts: maybe there is an opportunity if we translate the problem into another form?
You are given an array A of positive integers of size N.
Calculate the histogram H of this array.
The highest populated slot of this histogram has index m ( == max(A)).
Find the shortest sequence of selections of x for:
Select an index x <= m which satisfies sum(H[i]*(i-x)) <= K for i = x+1 .. m (search for suitable x starts from m down)
Add H[x .. m] to H[0 .. m-x]
Set the new m as the highest populated index in H[0 .. x-1] (we ignore everything from H[x] up)
Repeat until m == 0
If there is only a "good" but not optimal solution sought for, I could imagine that some kind of spectral analysis of H could hint towards favorable x selections so that maxima in the histogram pile upon other maxima in the reduction step.

Google Interview : Find the maximum sum of a polygon [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Given a polygon with N vertexes and N edges. There is an int number(could be negative) on every vertex and an operation in set (*,+) on every edge. Every time, we remove an edge E from the polygon, merge the two vertexes linked by the edge (V1,V2) to a new vertex with value: V1 op(E) V2. The last case would be two vertexes with two edges, the result is the bigger one.
Return the max result value can be gotten from a given polygon.
For the last case we might not need two merge as the other number could be negative, so in that case we would just return the larger number.
How I am approaching the problem:
p[i,j] denotes the maximum value we can obtain by merging nodes from labelled i to j.
p[i,i] = v[i] -- base case
p[i,j] = p[i,k] operator in between p[k+1,j] , for k between i to j-1.
and then p[0,n] will be my answer.
Second point , i will have to start from all the vertices and do the same as above as this will be cyclic n vertices n edges.
The time complexity for this is n^3 *n i.e n^4 .
Can i do better then this ?
As you have identified (tagged) correctly, this indeed is very similar to the matrix multiplication problem (in what order do I multiply matrixes in order to do it quickly).
This can be solved polynomially using a dynamic algorithm.
I'm going to instead solve a similar, more classic (and identical) problem, given a formula with numbers, addition and multiplications, what way of parenthesizing it gives the maximal value, for example
6+1 * 2 becomes (6+1)*2 which is more than 6+(1*2).
Let us denote our input a1 to an real numbers and o(1),...o(n-1) either * or +. Our approach will work as follows, we will observe the subproblem F(i,j) which represents the maximal formula (after parenthasizing) for a1,...aj. We will create a table of such subproblems and observe that F(1,n) is exactly the result we were looking for.
Define
F(i,j)
- If i>j return 0 //no sub-formula of negative length
- If i=j return ai // the maximal formula for one number is the number
- If i<j return the maximal value for all m between i (including) and j (not included) of:
F(i,m) (o(m)) F(m+1,j) //check all places for possible parenthasis insertion
This goes through all possible options. TProof of correctness is done by induction on the size n=j-i and is pretty trivial.
Lets go through runtime analysis:
If we do not save the values dynamically for smaller subproblems this runs pretty slow, however we can make this algorithm perform relatively fast in O(n^3)
We create a n*n table T in which the cell at index i,j contains F(i,j) filling F(i,i) and F(i,j) for j smaller than i is done in O(1) for each cell since we can calculate these values directly, then we go diagonally and fill F(i+1,i+1) (which we can do quickly since we already know all the previous values in the recursive formula), we repeat this n times for n diagonals (all the diagonals in the table really) and filling each cell takes (O(n)), since each cell has O(n) cells we fill each diagonals in O(n^2) meaning we fill all the table in O(n^3). After filling the table we obviously know F(1,n) which is the solution to your problem.
Now back to your problem
If you translate the polygon into n different formulas (one for starting at each vertex) and run the algorithm for formula values on it, you get exactly the value you want.
I think you can reduce the need for a brute force search. For example: if there is a chain of
x + y + z
You can replace it with a single vertex whose value is the sum, you can't do better than that. You need to do the multiplying after the addition when you're dealing with +ve integers. So if it's all positive then simply reduce all + chains and then mutliply.
So that leaves the cases where there are -ve numbers. Seems to me that the strategy for a single -ve number is pretty obvious, for two -ve numbers there are a few cases (remembering that - x - is positive) and for more than 2 -ve numbers it seems to get tricky :-)

Partition a set into k groups with minimum number of moves

You have a set of n objects for which integer positions are given. A group of objects is a set of objects at the same position (not necessarily all the objects at that position: there might be multiple groups at a single position). The objects can be moved to the left or right, and the goal is to move these objects so as to form k groups, and to do so with the minimum distance moved.
For example:
With initial positions at [4,4,7], and k = 3: the minimum cost is 0.
[4,4,7] and k = 2: minimum cost is 0
[1,2,5,7] and k = 2: minimum cost is 1 + 2 = 3
I've been trying to use a greedy approach (by calculating which move would be shortest) but that wouldn't work because every move involves two elements which could be moved either way. I haven't been able to formulate a dynamic programming approach as yet but I'm working on it.
This problem is a one-dimensional instance of the k-medians problem, which can be stated as follows. Given a set of points x_1...x_n, partition these points into k sets S_1...S_k and choose k locations y_1...y_k in a way that minimizes the sum over all x_i of |x_i - y_f(i)|, where y_f(i) is the location corresponding of the set to which x_i is assigned.
Due to the fact that the median is the population minimizer for absolute distance (i.e. L_1 norm), it follows that each location y_j will be the median of the elements x in the corresponding set S_j (hence the name k-medians). Since you are looking at integer values, there is the technicality that if S_j contains an even number of elements, the median might not be an integer, but in such cases choosing either the next integer above or below the median will give the same sum of absolute distances.
The standard heuristic for solving k-medians (and the related and more common k-means problem) is iterative, but this is not guaranteed to produce an optimal or even good solution. Solving the k-medians problem for general metric spaces is NP-hard, and finding efficient approximations for k-medians is an open research problem. Googling "k-medians approximation", for example, will lead to a bunch of papers giving approximation schemes.
http://www.cis.upenn.edu/~sudipto/mypapers/kmedian_jcss.pdf
http://graphics.stanford.edu/courses/cs468-06-winter/Papers/arr-clustering.pdf
In one dimension things become easier, and you can use a dynamic programming approach. A DP solution to the related one-dimensional k-means problem is described in this paper, and the source code in R is available here. See the paper for details, but the idea is essentially the same as what #SajalJain proposed, and can easily be adapted to solve the k-medians problem rather than k-means. For j<=k and m<=n let D(j,m) denote the cost of an optimal j-medians solution to x_1...x_m, where the x_i are assumed to be in sorted order. We have the recurrence
D(j,m) = min (D(j-1,q) + Cost(x_{q+1},...,x_m)
where q ranges from j-1 to m-1 and Cost is equal to the sum of absolute distances from the median. With a naive O(n) implementation of Cost, this would yield an O(n^3k) DP solution to the whole problem. However, this can be improved to O(n^2k) due to the fact that the Cost can be updated in constant time rather than computed from scratch every time, using the fact that, for a sorted sequence:
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_1...x_h)-x_1 if h is odd
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_2...x_h)-x_1 if h is even
See the writeup for more details. Except for the fact that the update of the Cost function is different, the implementation will be the same for k-medians as for k-means.
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
as I understand, the problems is:
we have n points on a line.
we want to place k position on the line. I call them destinations.
move each of n points to one of the k destinations so the sum of distances is minimum. I call this sum, total cost.
destinations can overlap.
An obvious fact is that for each point we should look for the nearest destinations on the left and the nearest destinations on the right and choose the nearest.
Another important fact is all destinations should be on the points. because we can move them on the line to right or to left to reach a point without increasing total distance.
By these facts consider following DP solution:
DP[i][j] means the minimum total cost needed for the first i point, when we can use only j destinations, and have to put a destination on the i-th point.
to calculate DP[i][j] fix the destination before the i-th point (we have i choice), and for each choice (for example k-th point) calculate the distance needed for points between the i-th point and the new point added (k-th point). add this with DP[k][j - 1] and find the minimum for all k.
the calculation of initial states (e.g. j = 1) and final answer is left as an exercise!
Task 0 - sort the position of the objects in non-decreasing order
Let us define 'center' as the position of the object where it is shifted to.
Now we have two observations;
For N positions the 'center' would be the position which is nearest to the mean of these N positions. Example, let 1,3,6,10 be the positions. Then mean = 5. Nearest position is 6. Hence the center for these elements is 6. This gives us the position with minimum cost of moving when all elements need to be grouped into 1 group.
Let N positions be grouped into K groups "optimally". When N+1 th object is added, then it will disturb only the K th group, i.e, first K-1 groups will remain unchanged.
From these observations, we build a dynamic programming approach.
Let Cost[i][k] and Center[i][k] be two 2D arrays.
Cost[i][k] = minimum cost when first 'i' objects are partitioned into 'k' groups
Center[i][k] stores the center of the 'i-th' object when Cost[i][k] is computed.
Let {L} be the elements from i-L,i-L+1,..i-1 which have the same center.
(Center[i-L][k] = Center[i-L+1][k] = ... = Center[i-1][k]) These are the only objects that need to be considered in the computation for i-th element (from observation 2)
Now
Cost[i][k] will be
min(Cost[i-1][k-1] , Cost[i-L-1][k-1] + computecost(i-L, i-L+1, ... ,i))
Update Center[i-L ... i][k]
computecost() can be found trivially by finding the center (from observation 1)
Time Complexity:
Sorting O(NlogN)
Total Cost Computation Matrix = Total elements * Computecost = O(NK * N)
Total = O(NlogN + N*NK) = O(N*NK)
Let's look at k=1.
For k=1 and n odd, all points should move to the center point. For k=1 and n even, all points should move to either of the center points or any spot between them. By 'center' I mean in terms of number of points to either side, i.e. the median.
You can see this because if you select a target spot, x, with more points to its right than it's left, then a new target 1 to the right of x would result in a cost reduction (unless there is exactly one more point to the right than the left and the target spot is a point, in which case n is even and the target is on/between the two center points).
If your points are already sorted, this is an O(1) operation. If not, I believe it's O(n) (via an order statistic algorithm).
Once you've found the spot that all points are moving to, it's O(n) to find the cost.
Thus regardless of whether the points are sorted or not, this is O(n).

Generating Random Matrix With Pairwise Distinct Rows and Columns

I need to randomly generate an NxN matrix of integers in the range 1 to K inclusive such that all rows and columns individually have the property that their elements are pairwise distinct.
For example for N=2 and K=3
This is ok:
1 2
2 1
This is not:
1 3
1 2
(Notice that if K < N this is impossible)
When K is sufficiently larger than N an efficient enough algorithm is just to generate a random matrix of 1..K integers, check that each row and each column is pairwise distinct, and if it isn't try again.
But what about the case where K is not much larger than N?
This is not a full answer, but a warning about an intuitive solution that does not work.
I am assuming that by "randomly generate" you mean with uniform probability on all existing such matrices.
For N=2 and K=3, here are the possible matrices, up to permutations of the set [1..K]:
1 2 1 2 1 2
2 1 2 3 3 1
(since we are ignoring permutations of the set [1..K], we can assume wlog that the first line is 1 2).
Now, an intuitive (but incorrect) strategy would be to draw the matrix entries one by one, ensuring for each entry that it is distinct from the other entries on the same line or column.
To see why it's incorrect, consider that we have drawn this:
1 2
x .
and we are now drawing x. x can be 2 or 3, but if we gave each possibility the probability 1/2, then the matrix
1 2
3 1
would get probability 1/2 of being drawn at the end, while it should have only probability 1/3.
Here is a (textual) solution. I don't think it provides good randomness, but nevertherless it could be ok for your application.
Let's generate a matrix in the range [0;K-1] (you will do +1 for all elements if you want to) with the following algorithm:
Generate the first line with any random method you want.
Each number will be the first element of a random sequence calculated in such a manner that you are guarranteed to have no duplicate in subsequent rows, that is for any distinct column x and y, you will have x[i]!=y[i] for all i in [0;N-1].
Compute each row for the previous one.
All the algorithm is based on the random generator with the property I mentioned. With a quick search, I found that the Inversive congruential generator meets this requirement. It seems to be easy to implement. It works if K is prime; if K is not prime, see on the same page 'Compound Inversive Generators'. Maybe it will be a little tricky to handle with perfect squares or cubic numbers (your problem sound like sudoku :-) ), but I think it is possible by creating compound generators with prime factors of K and different parametrization. For all generators, the first element of each column is the seed.
Whatever the value of K, the complexity is only depending on N and is O(N^2).
Deterministically generate a matrix having the desired property for rows and columns. Provided K > N, this can easily be done by starting the ith row with i, and filling in the rest of the row with i+1, i+2, etc., wrapping back to 1 after K. Other algorithms are possible.
Randomly permute columns, then randomly permute rows.
Let's show that permuting rows (i.e. picking up entire rows and assembling a new matrix from them in some order, with each row possibly in a different vertical position) leaves the desired properties intact for both rows and columns, assuming they were true before. The same reasoning then holds for column permutations, and for any sequence of permutations of either kind.
Trivially, permuting rows cannot change the property that, within each row, no element appears more than once.
The effect of permuting rows on a particular column is to reorder the elements within that column. This holds for any column, and since reordering elements cannot produce duplicate elements where there were none before, permuting rows cannot change the property that, within each column, no element appears more than once.
I'm not certain whether this algorithm is capable of generating all possible satisfying matrices, or if it does, whether it will generate all possible satisfying matrices with equal probability. Another interesting question that I don't have an answer for is: How many rounds of row-permutation-then-column-permutation are needed? More precisely, is any finite sequence of row-perm-then-column-perm rounds equivalent to a bounded number of (or in particular, one) row-perm-then-column-perm round? If so then nothing is gained by further permutations after the first row and column permutations. Perhaps someone with a stronger mathematics background can comment. But it may be good enough in any case.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

Resources