Iterative random weighted choice [duplicate]

Iterative random weighted choice [duplicate] - algorithm

Suppose that I have an n-sided loaded die, where each side k has some probability pk of coming up when I roll it. I’m curious if there is a good data structure for storing this information statically (i.e., for a fixed set of probabilities), so that I can efficiently simulate a random roll of the die.
Currently, I have an O(lg n) solution for this problem. The idea is to store a table of the cumulative probability of the first k sides for all k, then generate a random real number in the range [0, 1) and perform a binary search over the table to get the largest index whose cumulative value is no greater than the chosen value.
I rather like this solution, but it seems odd that the runtime doesn’t take the probabilities into account. In particular, in the extreme cases of one side always coming up or the values being uniformly distributed, it’s possible to generate the result of the roll in O(1) using a naive approach, while my solution will still take logarithmically many steps.
Does anyone have any suggestions for how to solve this problem in a way that is somehow “adaptive” in it’s runtime?
Update: Based on the answers to this question, I have written up an article describing many approaches to this problem, along with their analyses. It looks like Vose’s implementation of the alias method gives Θ(n) preprocessing time and O(1) time per die roll, which is truly impressive. Hopefully this is a useful addition to the information contained in the answers!

You are looking for the alias method which provides a O(1) method for generating a fixed discrete probability distribution (assuming you can access entries in an array of length n in constant time) with a one-time O(n) set-up. You can find it documented in chapter 3 (PDF) of "Non-Uniform Random Variate Generation" by Luc Devroye.
The idea is to take your array of probabilities pk and produce three new n-element arrays, qk, ak, and bk. Each qk is a probability between 0 and 1, and each ak and bk is an integer between 1 and n.
We generate random numbers between 1 and n by generating two random numbers, r and s, between 0 and 1. Let i = floor(r*N)+1. If qi < s then return ai else return bi. The work in the alias method is in figuring out how to produce qk, ak and bk.

Use a balanced binary search tree (or binary search in an array) and get O(log n) complexity. Have one node for each die result and have the keys be the interval that will trigger that result.
function get_result(node, seed):
if seed < node.interval.start:
return get_result(node.left_child, seed)
else if seed < node.interval.end:
// start <= seed < end
return node.result
else:
return get_result(node.right_child, seed)
The good thing about this solution is that is very simple to implement but still has good complexity.

I'm thinking of granulating your table.
Instead of having a table with the cumulative for each die value, you could create an integer array of length xN, where x is ideally a high number to increase accuracy of the probability.
Populate this array using the index (normalized by xN) as the cumulative value and, in each 'slot' in the array, store the would-be dice roll if this index comes up.
Maybe I could explain easier with an example:
Using three dice: P(1) = 0.2, P(2) = 0.5, P(3) = 0.3
Create an array, in this case I will choose a simple length, say 10. (that is, x = 3.33333)
arr[0] = 1,
arr[1] = 1,
arr[2] = 2,
arr[3] = 2,
arr[4] = 2,
arr[5] = 2,
arr[6] = 2,
arr[7] = 3,
arr[8] = 3,
arr[9] = 3
Then to get the probability, just randomize a number between 0 and 10 and simply access that index.
This method might loose accuracy, but increase x and accuracy will be sufficient.

There are many ways to generate a random integer with a custom distribution (also known as a discrete distribution). The choice depends on many things, including the number of integers to choose from, the shape of the distribution, and whether the distribution will change over time.
One of the simplest ways to choose an integer with a custom weight function f(x) is the rejection sampling method. The following assumes that the highest possible value of f is max and each weight is 0 or greater. The time complexity for rejection sampling is constant on average, but depends greatly on the shape of the distribution and has a worst case of running forever. To choose an integer in [1, k] using rejection sampling:
Choose a uniform random integer i in [1, k].
With probability f(i)/max, return i. Otherwise, go to step 1. (For example, if all the weights are integers greater than 0, choose a uniform random integer in [1, max] and if that number is f(i) or less, return i, or go to step 1 otherwise.)
Other algorithms have an average sampling time that doesn't depend so greatly on the distribution (usually either constant or logarithmic), but often require you to precalculate the weights in a setup step and store them in a data structure. Some of them are also economical in terms of the number of random bits they use on average. Many of these algorithms were introduced after 2011, and they include—
The Bringmann–Larsen succinct data structure ("Succinct Sampling from Discrete Distributions", 2012),
Yunpeng Tang's multi-level search ("An Empirical Study of Random Sampling Methods for Changing Discrete Distributions", 2019), and
the Fast Loaded Dice Roller (2020).
Other algorithms include the alias method (already mentioned in your article), the Knuth–Yao algorithm, the MVN data structure, and more. See my section "Weighted Choice With Replacement" for a survey.

Related

Sample with given probability

I stumbled upon a basic discrete math/probability question and I wanted to get some ideas for improvements over my solution.
Assume you are given a collection (an alphabet, the natural numbers, etc.). How do you ensure that you draw a certain value X from this collection with a given probability P?
I'll explain my naïve solution with an example:
Collection = {A, B}
X = A, P = 1/4
We build an array v = [A, B, B, B] and we use a rand function to uniformly sample the indices of the array, i.e., {0, 1, 2, 3}
This approach works, but isn't efficient: the smaller P, the bigger the memory storage of v. Hence, I was wondering what ideas the stackoverflow community might have in improving this.
Thanks!

Partition the interval [0,1] into disjoint intervals whose union is [0,1]. Create the size of each partition to correspond to the probability of selecting each event. Then simply sample randomly from [0,1], evaluate which of your partitions the result lies in, then look up the selection that corresponds to that interval. In your example, this would result in the following 2 intervals [0,1/4) and [1/4,1] - generate a random uniform value from [0,1]. If your sample lies in the first interval then your selection X = A , if in the other interval then X = B.

Your proposed solution is indeed not great, and the most general and efficient way to solve it is as mathematician1975 states (this is known as the inverse CDF method). For your specific problem, which is multinomial sampling, you can also use a series of draws from binomial distributions to sample from your collection. This is often more intuitive if you're not familiar with sampling methods.
If the first item in the collection has probability p_1, sample uniformly in the interval [0-1]. If the sample is less than p_1, return item 1. Otherwise, renormalise the remaining outcomes by 1-p_1 and repeat the process with the next possible outcome. After each unsuccessful sampling, renormalise remaining outcomes by the total probability of rejected outcomes, so that the sum of remaining outcomes is 1. If you get to the last outcome, return it with probability 1. The result of the process will be random samples distributed according to your original vector.
This method is using the fact that individual components of a multinomial are binomially distributed, and any sub vector of the multinomial is also multinomial with parameters given by the renormalisation I describe above.

Biasing random number generator to some integer n with deviation b

Given an integer range R = [a, b] (where a >=0 and b <= 100), a bias integer n in R, and some deviation b, what formula can I use to skew a random number generator towards n?
So for example if I had the numbers 1 through 10 inclusively and I don't specify a bias number, then I should in theory have equal chances of randomly drawing one of them.
But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.
And if I specify a deviation of say 2 in addition to the bias number, then the number generator should be drawing from 1 through 5 a more frequently than 6 through 10.
What algorithm can I use to achieve this?
I'm using Ruby if it makes it any easier/harder.

i think the simplest route is to sample from a normal (aka gaussian) distribution with the properties you want, and then transform the result:
generate a normal value with given mean and sd
round to nearest integer
if outside given range (normal can generate values over the entire range from -infinity to -infinity), discard and repeat
if you need to generate a normal from a uniform the simplest transform is "box-muller".
there are some details you may need to worry about. in particular, box muller is limited in range (it doesn't generate extremely unlikely values, ever). so if you give a very narrow range then you will never get the full range of values. other transforms are not as limited - i'd suggest using whatever ruby provides (look for "normal" or "gaussian").
also, be careful to round the value. 2.6 to 3.4 should all become 3, for example. if you simply discard the decimal (so 3.0 to 3.999 become 3) you will be biased.
if you're really concerned with efficiency, and don't want to discard values, you can simply invent something. one way to cheat is to mix a uniform variate with the bias value (so 9/10 times generate the uniform, 1/10 times return 3, say). in some cases, where you only care about average of the sample, that can be sufficient.

For the first part "But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.", a very easy solution:
def randBias(a,b,biasedNum=None, bias=0):
x = random.randint(a, b+bias)
if x<= b:
return x
else:
return biasedNum
For the second part, I would say it depends on the task. In a case where you need to generate a billion random numbers from the same distribution, I would calculate the probability of the numbers explicitly and use weighted random number generator (see Random weighted choice )

If you want an unimodal distribution (where the bias is just concentrated in one particular value of your range of number, for example, as you state 3), then the answer provided by andrew cooke is good---mostly because it allows you to fine tune the deviation very accurately.
If however you wish to make several biases---for instance you want a trimodal distribution, with the numbers a, (a+b)/2 and b more frequently than others, than you would do well to implement weighted random selection.
A simple algorithm for this was given in a recent question on StackOverflow; it's complexity is linear. Using such an algorithm, you would simply maintain a list, initial containing {a, a+1, a+2,..., b-1, b} (so of size b-a+1), and when you want to add a bias towards X, you would several copies of X to the list---depending on how much you want to bias. Then you pick a random item from the list.
If you want something more efficient, the most efficient method is called the "Alias method" which was implemented very clearly in Python by Denis Bzowy; once your array has been preprocessed, it runs in constant time (but that means that you can't update the biases anymore once you've done the preprocessing---or you would to reprocess the table).
The downside with both techniques is that unlike with the Gaussian distribution, biasing towards X, will not bias also somewhat towards X-1 and X+1. To simulate this effect you would have to do something such as
def addBias(x, L):
L = concatList(L, [x, x, x, x, x])
L = concatList(L, [x+2])
L = concatList(L, [x+1, x+1])
L = concatList(L, [x-1,x-1,x-1])
L = concatList(L, [x-2])

Data structures for loaded dice?

You are looking for the alias method which provides a O(1) method for generating a fixed discrete probability distribution (assuming you can access entries in an array of length n in constant time) with a one-time O(n) set-up. You can find it documented in chapter 3 (PDF) of "Non-Uniform Random Variate Generation" by Luc Devroye.
The idea is to take your array of probabilities pk and produce three new n-element arrays, qk, ak, and bk. Each qk is a probability between 0 and 1, and each ak and bk is an integer between 1 and n.
We generate random numbers between 1 and n by generating two random numbers, r and s, between 0 and 1. Let i = floor(r*N)+1. If qi < s then return ai else return bi. The work in the alias method is in figuring out how to produce qk, ak and bk.

Use a balanced binary search tree (or binary search in an array) and get O(log n) complexity. Have one node for each die result and have the keys be the interval that will trigger that result.
function get_result(node, seed):
if seed < node.interval.start:
return get_result(node.left_child, seed)
else if seed < node.interval.end:
// start <= seed < end
return node.result
else:
return get_result(node.right_child, seed)
The good thing about this solution is that is very simple to implement but still has good complexity.

I'm thinking of granulating your table.
Instead of having a table with the cumulative for each die value, you could create an integer array of length xN, where x is ideally a high number to increase accuracy of the probability.
Populate this array using the index (normalized by xN) as the cumulative value and, in each 'slot' in the array, store the would-be dice roll if this index comes up.
Maybe I could explain easier with an example:
Using three dice: P(1) = 0.2, P(2) = 0.5, P(3) = 0.3
Create an array, in this case I will choose a simple length, say 10. (that is, x = 3.33333)
arr[0] = 1,
arr[1] = 1,
arr[2] = 2,
arr[3] = 2,
arr[4] = 2,
arr[5] = 2,
arr[6] = 2,
arr[7] = 3,
arr[8] = 3,
arr[9] = 3
Then to get the probability, just randomize a number between 0 and 10 and simply access that index.
This method might loose accuracy, but increase x and accuracy will be sufficient.

There are many ways to generate a random integer with a custom distribution (also known as a discrete distribution). The choice depends on many things, including the number of integers to choose from, the shape of the distribution, and whether the distribution will change over time.
One of the simplest ways to choose an integer with a custom weight function f(x) is the rejection sampling method. The following assumes that the highest possible value of f is max and each weight is 0 or greater. The time complexity for rejection sampling is constant on average, but depends greatly on the shape of the distribution and has a worst case of running forever. To choose an integer in [1, k] using rejection sampling:
Choose a uniform random integer i in [1, k].
With probability f(i)/max, return i. Otherwise, go to step 1. (For example, if all the weights are integers greater than 0, choose a uniform random integer in [1, max] and if that number is f(i) or less, return i, or go to step 1 otherwise.)
Other algorithms have an average sampling time that doesn't depend so greatly on the distribution (usually either constant or logarithmic), but often require you to precalculate the weights in a setup step and store them in a data structure. Some of them are also economical in terms of the number of random bits they use on average. Many of these algorithms were introduced after 2011, and they include—
The Bringmann–Larsen succinct data structure ("Succinct Sampling from Discrete Distributions", 2012),
Yunpeng Tang's multi-level search ("An Empirical Study of Random Sampling Methods for Changing Discrete Distributions", 2019), and
the Fast Loaded Dice Roller (2020).
Other algorithms include the alias method (already mentioned in your article), the Knuth–Yao algorithm, the MVN data structure, and more. See my section "Weighted Choice With Replacement" for a survey.

From an interview: Removing rows and columns in an n×n matrix to maximize the sum of remaining values

Given an n×n matrix of real numbers. You are allowed to erase any number (from 0 to n) of rows and any number (from 0 to n) of columns, and after that the sum of the remaining entries is computed. Come up with an algorithm which finds out which rows and columns to erase in order to maximize that sum.

The problem is NP-hard. (So you should not expect a polynomial-time algorithm for solving this problem. There could still be (non-polynomial time) algorithms that are slightly better than brute-force, though.) The idea behind the proof of NP-hardness is that if we could solve this problem, then we could solve the the clique problem in a general graph. (The maximum-clique problem is to find the largest set of pairwise connected vertices in a graph.)
Specifically, given any graph with n vertices, let's form the matrix A with entries a[i][j] as follows:
a[i][j] = 1 for i == j (the diagonal entries)
a[i][j] = 0 if the edge (i,j) is present in the graph (and i≠j)
a[i][j] = -n-1 if the edge (i,j) is not present in the graph.
Now suppose we solve the problem of removing some rows and columns (or equivalently, keeping some rows and columns) so that the sum of the entries in the matrix is maximized. Then the answer gives the maximum clique in the graph:
Claim: In any optimal solution, there is no row i and column j kept for which the edge (i,j) is not present in the graph. Proof: Since a[i][j] = -n-1 and the sum of all the positive entries is at most n, picking (i,j) would lead to a negative sum. (Note that deleting all rows and columns would give a better sum, of 0.)
Claim: In (some) optimal solution, the set of rows and columns kept is the same. This is because starting with any optimal solution, we can simply remove all rows i for which column i has not been kept, and vice-versa. Note that since the only positive entries are the diagonal ones, we do not decrease the sum (and by the previous claim, we do not increase it either).
All of which means that if the graph has a maximum clique of size k, then our matrix problem has a solution with sum k, and vice-versa. Therefore, if we could solve our initial problem in polynomial time, then the clique problem would also be solved in polynomial time. This proves that the initial problem is NP-hard. (Actually, it is easy to see that the decision version of the initial problem — is there a way of removing some rows and columns so that the sum is at least k — is in NP, so the (decision version of the) initial problem is actually NP-complete.)

Well the brute force method goes something like this:
For n rows there are 2n subsets.
For n columns there are 2n subsets.
For an n x n matrix there are 22n subsets.
0 elements is a valid subset but obviously if you have 0 rows or 0 columns the total is 0 so there are really 22n-2+1 subsets but that's no different.
So you can work out each combination by brute force as an O(an) algorithm. Fast. :)
It would be quicker to work out what the maximum possible value is and you do that by adding up all the positive numbers in the grid. If those numbers happen to form a valid sub-matrix (meaning you can create that set by removing rows and/or columns) then there's your answer.
Implicit in this is that if none of the numbers are negative then the complete matrix is, by definition, the answer.
Also, knowing what the highest possible maximum is possibly allows you to shortcut the brute force evaluation since if you get any combination equal to that maximum then that is your answer and you can stop checking.
Also if all the numbers are non-positive, the answer is the maximum value as you can reduce the matrix to a 1 x 1 matrix with that 1 value in it, by definition.
Here's an idea: construct 2n-1 n x m matrices where 1 <= m <= n. Process them one after the other. For each n x m matrix you can calculate:
The highest possible maximum sum (as per above); and
Whether no numbers are positive allowing you to shortcut the answer.
if (1) is below the currently calculate highest maximum sum then you can discard this n x m matrix. If (2) is true then you just need a simple comparison to the current highest maximum sum.
This is generally referred to as a pruning technique.
What's more you can start by saying that the highest number in the n x n matrix is the starting highest maximum sum since obviously it can be a 1 x 1 matrix.
I'm sure you could tweak this into a (slightly more) efficient recursive tree-based search algorithm with the above tests effectively allowing you to eliminate (hopefully many) unnecessary searches.

We can improve on Cletus's generalized brute-force solution by modelling this as a directed graph. The initial matrix is the start node of the graph; its leaves are all the matrices missing one row or column, and so forth. It's a graph rather than a tree, because the node for the matrix without both the first column and row will have two parents - the nodes with just the first column or row missing.
We can optimize our solution by turning the graph into a tree: There's never any point exploring a submatrix with a column or row deleted that comes before the one we deleted to get to the current node, as that submatrix will be arrived at anyway.
This is still a brute-force search, of course - but we've eliminated the duplicate cases where we remove the same rows in different orders.
Here's an example implementation in Python:
def maximize_sum(m):
frontier = [(m, 0, False)]
best = None
best_score = 0
while frontier:
current, startidx, cols_done = frontier.pop()
score = matrix_sum(current)
if score > best_score or not best:
best = current
best_score = score
w, h = matrix_size(current)
if not cols_done:
for x in range(startidx, w):
frontier.append((delete_column(current, x), x, False))
startidx = 0
for y in range(startidx, h):
frontier.append((delete_row(current, y), y, True))
return best_score, best
And here's the output on 280Z28's example matrix:
>>> m = ((1, 1, 3), (1, -89, 101), (1, 102, -99))
>>> maximize_sum(m)
(106, [(1, 3), (1, 101)])

Since nobody asked for an efficient algorithm, use brute force: generate every possible matrix that can be created by removing rows and/or columns from the original matrix, choose the best one. A slightly more efficent version, which most likely can be proved to still be correct, is to generate only those variants where the removed rows and columns contain at least one negative value.

To try it in a simple way:
We need the valid subset of the set of entries {A00, A01, A02, ..., A0n, A10, ...,Ann} which max. sum.
First compute all subsets (the power set).
A valid subset is a member of the power set that for each two contained entries Aij and A(i+x)(j+y), contains also the elements A(i+x)j and Ai(j+y) (which are the remaining corners of the rectangle spanned by Aij and A(i+x)(j+y)).
Aij ...
. .
. .
... A(i+x)(j+y)
By that you can eliminate the invalid ones from the power set and find the one with the biggest sum in the remaining.
I'm sure it can be improved by improving an algorithm for power set generation in order to generate only valid subsets and by that avoiding step 2 (adjusting the power set).

I think there are some angles of attack that might improve upon brute force.
memoization, since there are many distinct sequences of edits that will arrive at the same submatrix.
dynamic programming. Because the search space of matrices is highly redundant, my intuition is that there would be a DP formulation that can save a lot of repeated work
I think there's a heuristic approach, but I can't quite nail it down:
if there's one negative number, you can either take the matrix as it is, remove the column of the negative number, or remove its row; I don't think any other "moves" result in a higher sum. For two negative numbers, your options are: remove neither, remove one, remove the other, or remove both (where the act of removal is either by axing the row or the column).
Now suppose the matrix has only one positive number and the rest are all <=0. You clearly want to remove everything but the positive entry. For a matrix with only 2 positive entries and the rest <= 0, the options are: do nothing, whittle down to one, whittle down to the other, or whittle down to both (resulting in a 1x2, 2x1, or 2x2 matrix).
In general this last option falls apart (imagine a matrix with 50 positives & 50 negatives), but depending on your data (few negatives or few positives) it could provide a shortcut.

Create an n-by-1 vector RowSums, and an n-by-1 vector ColumnSums. Initialize them to the row and column sums of the original matrix. O(n²)
If any row or column has a negative sum, remove edit: the one with the minimum such and update the sums in the other direction to reflect their new values. O(n)
Stop when no row or column has a sum less than zero.
This is an iterative variation improving on another answer. It operates in O(n²) time, but fails for some cases mentioned in other answers, which is the complexity limit for this problem (there are n² entries in the matrix, and to even find the minimum you have to examine each cell once).
Edit: The following matrix has no negative rows or columns, but is also not maximized, and my algorithm doesn't catch it.
1 1 3 goal 1 3
1 -89 101 ===> 1 101
1 102 -99
The following matrix does have negative rows and columns, but my algorithm selects the wrong ones for removal.
-5 1 -5 goal 1
1 1 1 ===> 1
-10 2 -10 2
mine
===> 1 1 1

Compute the sum of each row and column. This can be done in O(m) (where m = n^2)
While there are rows or columns that sum to negative remove the row or column that has the lowest sum that is less than zero. Then recompute the sum of each row/column.
The general idea is that as long as there is a row or a column that sums to nevative, removing it will result in a greater overall value. You need to remove them one at a time and recompute because in removing that one row/column you are affecting the sums of the other rows/columns and they may or may not have negative sums any more.
This will produce an optimally maximum result. Runtime is O(mn) or O(n^3)

I cannot really produce an algorithm on top of my head, but to me it 'smells' like dynamic programming, if it serves as a start point.

Big Edit: I honestly don't think there's a way to assess a matrix and determine it is maximized, unless it is completely positive.
Maybe it needs to branch, and fathom all elimination paths. You never no when a costly elimination will enable a number of better eliminations later. We can short circuit if it's found the theoretical maximum, but other than any algorithm would have to be able to step forward and back. I've adapted my original solution to achieve this behaviour with recursion.
Double Secret Edit: It would also make great strides to reduce to complexity if each iteration didn't need to find all negative elements. Considering that they don't change much between calls, it makes more sense to just pass their positions to the next iteration.
Takes a matrix, the list of current negative elements in the matrix, and the theoretical maximum of the initial matrix. Returns the matrix's maximum sum and the list of moves required to get there. In my mind move list contains a list of moves denoting the row/column removed from the result of the previous operation.
Ie: r1,r1
Would translate
-1 1 0 1 1 1
-4 1 -4 5 7 1
1 2 4 ===>
5 7 1
Return if sum of matrix is the theoretical maximum
Find the positions of all negative elements unless an empty set was passed in.
Compute sum of matrix and store it along side an empty move list.
For negative each element:
Calculate the sum of that element's row and column.
clone the matrix and eliminate which ever collection has the minimum sum (row/column) from that clone, note that action as a move list.
clone the list of negative elements and remove any that are effected by the action taken in the previous step.
Recursively call this algorithm providing the cloned matrix, the updated negative element list and the theoretical maximum. Append the moves list returned to the move list for the action that produced the matrix passed to the recursive call.
If the returned value of the recursive call is greater than the stored sum, replace it and store the returned move list.
Return the stored sum and move list.
I'm not sure if it's better or worse than the brute force method, but it handles all the test cases now. Even those where the maximum contains negative values.

This is an optimization problem and can be solved approximately by an iterative algorithm based on simulated annealing:
Notation: C is number of columns.
For J iterations:
Look at each column and compute the absolute benefit of toggling it (turn it off if it's currently on or turn it on if it's currently off). That gives you C values, e.g. -3, 1, 4. A greedy deterministic solution would just pick the last action (toggle the last column to get a benefit of 4) because it locally improves the objective. But that might lock us into a local optimum. Instead, we probabilistically pick one of the three actions, with probabilities proportional to the benefits. To do this, transform them into a probability distribution by putting them through a Sigmoid function and normalizing. (Or use exp() instead of sigmoid()?) So for -3, 1, 4 you get 0.05, 0.73, 0.98 from the sigmoid and 0.03, 0.42, 0.56 after normalizing. Now pick the action according to the probability distribution, e.g. toggle the last column with probability 0.56, toggle the second column with probability 0.42, or toggle the first column with the tiny probability 0.03.
Do the same procedure for the rows, resulting in toggling one of the rows.
Iterate for J iterations until convergence.
We may also, in early iterations, make each of these probability distributions more uniform, so that we don't get locked into bad decisions early on. So we'd raise the unnormalized probabilities to a power 1/T, where T is high in early iterations and is slowly decreased until it approaches 0. For example, 0.05, 0.73, 0.98 from above, raised to 1/10 results in 0.74, 0.97, 1.0, which after normalization is 0.27, 0.36, 0.37 (so it's much more uniform than the original 0.05, 0.73, 0.98).

It's clearly NP-Complete (as outlined above). Given this, if I had to propose the best algorithm I could for the problem:
Try some iterations of quadratic integer programming, formulating the problem as: SUM_ij a_ij x_i y_j, with the x_i and y_j variables constrained to be either 0 or 1. For some matrices I think this will find a solution quickly, for the hardest cases it would be no better than brute force (and not much would be).
In parallel (and using most of the CPU), use a approximate search algorithm to generate increasingly better solutions. Simulating Annealing was suggested in another answer, but having done research on similar combinatorial optimisation problems, my experience is that tabu search would find good solutions faster. This is probably close to optimal in terms of wandering between distinct "potentially better" solutions in the shortest time, if you use the trick of incrementally updating the costs of single changes (see my paper "Graph domination, tabu search and the football pool problem").
Use the best solution so far from the second above to steer the first by avoiding searching possibilities that have lower bounds worse than it.
Obviously this isn't guaranteed to find the maximal solution. But, it generally would when this is feasible, and it would provide a very good locally maximal solution otherwise. If someone had a practical situation requiring such optimisation, this is the solution that I'd think would work best.
Stopping at identifying that a problem is likely to be NP-Complete will not look good in a job interview! (Unless the job is in complexity theory, but even then I wouldn't.) You need to suggest good approaches - that is the point of a question like this. To see what you can come up with under pressure, because the real world often requires tackling such things.

yes, it's NP-complete problem.
It's hard to easily find the best sub-matrix,but we can easily to find some better sub-matrix.
Assume that we give m random points in the matrix as "feeds". then let them to automatically extend by the rules like :
if add one new row or column to the feed-matrix, ensure that the sum will be incrementive.
,then we can compare m sub-matrix to find the best one.

Let's say n = 10.
Brute force (all possible sets of rows x all possible sets of columns) takes
2^10 * 2^10 =~ 1,000,000 nodes.
My first approach was to consider this a tree search, and use
the sum of positive entries is an upper bound for every node in the subtree
as a pruning method. Combined with a greedy algorithm to cheaply generate good initial bounds, this yielded answers in about 80,000 nodes on average.
but there is a better way ! i later realised that
Fix some choice of rows X.
Working out the optimal columns for this set of rows is now trivial (keep a column if its sum of its entries in the rows X is positive, otherwise discard it).
So we can just brute force over all possible choices of rows; this takes 2^10 = 1024 nodes.
Adding the pruning method brought this down to 600 nodes on average.
Keeping 'column-sums' and incrementally updating them when traversing the tree of row-sets should allow the calculations (sum of matrix etc) at each node to be O(n) instead of O(n^2). Giving a total complexity of O(n * 2^n)

For slightly less than optimal solution, I think this is a PTIME, PSPACE complexity issue.
The GREEDY algorithm could run as follows:
Load the matrix into memory and compute row totals. After that run the main loop,
1) Delete the smallest row,
2) Subtract the newly omitted values from the old row totals
--> Break when there are no more negative rows.
Point two is a subtle detail: subtracted two rows/columns has time complexity n.
While re-summing all but two columns has n^2 time complexity!

Take each row and each column and compute the sum. For a 2x2 matrix this will be:
2 1
3 -10
Row(0) = 3
Row(1) = -7
Col(0) = 5
Col(1) = -9
Compose a new matrix
Cost to take row Cost to take column
3 5
-7 -9
Take out whatever you need to, then start again.
You just look for negative values on the new matrix. Those are values that actually substract from the overall matrix value. It terminates when there're no more negative "SUMS" values to take out (therefore all columns and rows SUM something to the final result)
In an nxn matrix that would be O(n^2)Log(n) I think

function pruneMatrix(matrix) {
max = -inf;
bestRowBitField = null;
bestColBitField = null;
for(rowBitField=0; rowBitField<2^matrix.height; rowBitField++) {
for (colBitField=0; colBitField<2^matrix.width; colBitField++) {
sum = calcSum(matrix, rowBitField, colBitField);
if (sum > max) {
max = sum;
bestRowBitField = rowBitField;
bestColBitField = colBitField;
}
}
}
return removeFieldsFromMatrix(bestRowBitField, bestColBitField);
}
function calcSumForCombination(matrix, rowBitField, colBitField) {
sum = 0;
for(i=0; i<matrix.height; i++) {
for(j=0; j<matrix.width; j++) {
if (rowBitField & 1<<i && colBitField & 1<<j) {
sum += matrix[i][j];
}
}
}
return sum;
}

How to calculate or approximate the median of a list without storing the list

I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?
Ideally I would like to write my code a bit like the following
var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);
All I need is the actual MedianCalculator code!
Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.
Thanks
Nick

If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.

There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)
It's simple and pretty robust. The reference is here...
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf
Hope this helps
Michael

I use these incremental/recursive mean and median estimators, which both use constant storage:
mean += eta * (sample - mean)
median += eta * sgn(sample - median)
where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)
This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.
Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.
I would love to see an incremental mode estimator of a similar form...
(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)

Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
You have limited memory, say O(log n) where n is the number of items you want
You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.
The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.
This algorithm uses O(log n) additional space and runs in Linear time.

This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.
The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.
But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.
Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.
Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).
Now stream in new values. For each one, binary search to find the bin it belongs to.
In the common case, you just increment the population of that bin and continue.
If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it.
When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.
But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution.
You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done!
Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.
This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.
The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.
I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.
The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).

David's suggestion seems like the most sensible approach for approximating the median.
A running mean for the same problem is a much easier to calculate:
Mn = Mn-1 + ((Vn - Mn-1) / n)
Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.
In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.
In code this would look something like:
new_mean = prev_mean + ((value - prev_mean) / count)
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.

I don't think it is possible to do without having the list in memory. You can obviously approximate with
average if you know that the data is symmetrically distributed
or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)

Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue
Let MedianIndex = (N+1)/2
1st Order Binary Search:
Repeat the following 4 steps until LowValue < HighValue.
Get MedianValue approximately = ( HighValue + LowValue ) / 2
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is K = MedianIndex, then return MedianValue
is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory
2nd Order Binary Search:
LowIndex=1
HighIndex=N
Repeat Following 5 Steps until (LowIndex < HighIndex)
Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
is (K=MedianIndex) ? return MedianValue
is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory
We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...

Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp
This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.
If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).
Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf

I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:
Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
i : Integer;
Begin
quantile:= mean + 1.25*sigma*(p-0.5);
q1:=quantile;
eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
dq:=abs(q1-quantile);
If dq>eta
then Begin
If dq<3*eta then eta:=eta/4;
For i:=1 to n Do
quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
end;
QuantileIterative:=quantile
end;
As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.

On homogeneous random ordered and for big enough list, this pseudo code can work:
# find min on the fly
if minDataPoint > dataPoint:
minDataPoint = dataPoint
# find max on the fly
if maxDataPoint < dataPoint:
maxDataPoint = dataPoint
# estimate median base on the current data
estimate_mid = (maxDataPoint + minDataPoint) / 2
#if **new** dataPoint is closer to the mid? stor it
if abs(midDataPoint - estimate_mid) > abs(dataPoint - estimate_mid):
midDataPoint = dataPoint
Inspired by #lakshmanaraj

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio