Algorithm for making two histograms proportional, minimizing units removed - performance

Imagine you have two histograms with an equal number of bins. N observations are distributed among the bins. Each bin now has between 0 and N observations.
What algorithm would be appropriate for determining the minimum number of observations to remove from both histograms in order to make them proportional? They do not need to be equal in absolute number, only proportional to each other. That is, there must be a common factor by which all the bins in one histogram can be multiplied in order to make it equal to the other histogram.
For example, imagine the following two histograms, where the item i in each histogram refers to the number of observations in bin i for the respective histogram.
Histogram 1: 4, 7, 4, 9
Histogram 2: 2, 0, 2, 1
For these histograms, the solution would be to remove from histogram 1 all 7 observations in bin 2 and another 7 observations from bin 4, such that (histogram 1)*2 = histogram 2.
But what general algorithm could be used to find the subsets of the two histograms that maximized the number of total observations between them while making them proportional? You can drop observations from both histograms or just one.
Thanks!

Seems to me that the problem is equivalent (if you consider each histogram as a N-dimensional vector), to minimizing the Manhattan length |R|, where R=xA-B, A and B are your 'vectors' and x is your proportional scale.
|R| has a single minimum (not necessarily an integer) so you can find it fairly rapidly using a simple bisection algorithm (or something akin to Newton's method).
Then, assuming you want a solution where the proportion is an integer, test the two cases ceil(x), and floor(x), to find which has the smallest Manhattan length (and that is the number of observations you need to remove).
Proof that the problem is not NP-hard:
Consider an inefficient 'solution' whereby you removed all N observations from all the bins. Now both A and B are equal to the 'zero' histogram 0 = (0,0,0,...). The two histograms are equal and thus proportional as 0 = s * 0 for all proportional values s, so a hard maximum for the number of observations to remove is N.
Now assume a more efficient solution exists with assitions/removals < N and a proportional scale s > 2*N (i.e after removal of some observations A = N * B or B=N * A ). If both A = 0 and B = 0, we have the previous solution with N removals (which contradicts the assumption that there are less than N removals). If A = 0 and B ≠ 0 then there is no s <> 0 such that 0 = s * B and no s such that s * 0 = B (with a similar argument for B = 0 and S ≠ 0). So it must be the case that both A ≠ 0 and B ≠ 0. Assume for a moment that A is the histogram to be scaled (so A * s = B), A must have at least one non-zero entry A[i] with minimum value 1 (after removal of extra observations), so when scaled it will have minimum value ≥. Therefore the equivalent entry B[i] must also have at least 2*N observations. But the total number of observations was initially N, so we have needed to add at least N observations to B[i], which contradicts the assumption that the improved solution had less than N additions/removals. So no 'efficient' solution requires a proportional scale greater than N.
So to find an efficient solution requires, at worst, testing the 'best fit' solution for scaling factors in the range 0-N.
The 'best fit' solution for scaling factor s in A = s * B, where A and B have M bins each requires
Sum(i=1 to M) of { Abs(A[i]- s * B[i]) mod s + Abs(A[i]- s * B[i]) div s } additions/removals.
This is an order M operation, so to test for each scaling factor in the range 0-N will be an algorithm of order O(M*N)
I am fairly certain (but haven't got a formal proof), that the scale factor cannot exceed the number of observations in the most filled bin. In practice it is typically very much smaller. For two histograms with two hundred bins and randomly chosen 30-300 observations per bin: if there were Na > Nb total observations in all the bins of A and B respectively the scaling factor was either almost always found in the range Na/Nb-4 < s < Na/Nb + 4, (or s = 0 if Na >> Nb).

Related

Maximize minimum distance between arrays

Lets say that you are given n sorted arrays of numbers and you need to pick one number from each array such that the minimum distance between the n chosen elements is maximized.
Example:
arrays:
[0, 500]
[100, 350]
[200]
2<=n<=10 and every array could have ~10^3-10^4 elements.
In this example the optimal solution to maximize minimum distance is pick numbers: 500, 350, 200 or 0, 200, 350 where min distance is 150 and is the maximum possible of every combination.
I am looking for an algorithm to solve this. I know that I could binary search the max min distance but I can't see how to decide is there is a solution with max min distance of at least d, in order for the binary search to work. I am thinking maybe dynamic programming could help but haven't managed to find a solution with dp.
Of course generating all combination with n elements is not efficient. I have already tried backtracking but it is slow since it tries every combination.
n ≤ 10 suggests that we can take an exponential dependence on n. Here's
an O(2n m n)-time algorithm where m is the total size of the
arrays.
The dynamic programming approach I have in mind is, for each subset of
arrays, calculate all of the pairs (maximum number, minimum distance) on
the efficient frontier, where we have to choose one number from each of
the arrays in the subset. By efficient frontier I mean that if we have
two pairs (a, b) ≠ (c, d) with a ≤ c and b ≥ d, then (c, d) is not on
the efficient frontier. We'll want to keep these frontiers sorted for
fast merges.
The base case with the empty subset is easy: there's one pair, (minimum
distance = ∞, maximum number = −∞).
For every nonempty subset of arrays in some order that extends the
inclusion order, we compute a frontier for each array in the subset,
representing the subset of solutions where that array contributes the
maximum number. Then we merge these frontiers. (Naively this costs us
another factor of log n, which maybe isn't worth the hassle to avoid
given that n ≤ 10, but we can avoid it by merging the arrays once at the
beginning to enable future merges to use bucketing.)
To construct a new frontier from a subset of arrays and another array
also involves a merge. We initialize an iterator at the start of the
frontier (i.e., least maximum number) and an iterator at the start of
the array (i.e., least number). While neither iterator is past the end,
Emit a candidate pair (min(minimum distance, array number − maximum
number), array number).
If the min was less than or equal to minimum distance, increment the
frontier iterator. If the min was less than or equal to array number
− maximum number, increment the array iterator.
Cull the candidate pairs to leave only the efficient frontier. There is
an elegant way to do this in code that is more trouble to explain.
I am going to give an algorithm that for a given distance d, will output whether it is possible to make a selection where the distance between any pair of chosen numbers is at least d. Then, you can binary-search the maximum d for which the algorithm outputs "YES", in order to find the answer to your problem.
Assume the minimum distance d be given. Here is the algorithm:
for every permutation p of size n do:
last := -infinity
ok := true
for p_i in p do:
x := take the smallest element greater than or equal to last+d in the p_i^th array (can be done efficiently with binary search).
if no such x was found; then
ok = false
break
end
last = x
done
if ok; then
return "YES"
end
done
return "NO"
So, we brute-force the order of arrays. Then, for every possible order, we use a greedy method to choose elements from each array, following the order. For example, take the example you gave:
arrays:
[0, 500]
[100, 350]
[200]
and assume d = 150. For the permutation 1 3 2, we first take 0 from the 1st array, then we find the smallest element in the 3rd array that is greater than or equal to 0+150 (it is 200), then we find the smallest element in the 2nd array which is greater than or equal to 200+150 (it is 350). Since we could find an element from every array, the algorithm outputs "YES". But for d = 200 for instance, the algorithm would output "NO" because none of the possible orderings would result in a successful selection.
The complexity for the above algorithm is O(n! * n * log(m)) where m is the maximum number of elements in an array. I believe it would be sufficient, since n is very small. (For m = 10^4, 10! * 10 * 13 ~ 5*10^8. It can be computed under a second on a modern CPU.)
Lets look at an example with optimal choices, x (horizontal arrays A, B, C, D):
A x
B b x b
C x c
D d x
Our recurrence based on range could be: let f(low, excluded) represent the maximum closest distance between two chosen elements (from arrays 1 to n) of the subset without elements in excluded, where low is the lowest chosen element. Then:
(1)
f(low, excluded) when |excluded| = n-1:
max(low)
for low in the only permitted array
(2)
f(low, excluded):
max(
min(
a - low,
f(a, excluded')
)
)
for a ≥ low, a not in excluded'
where excluded' = excluded ∪ {low's array}
We can limit a. For one thing the maximum we can achieve is
(3)
m = (highest - low) / (n - |excluded| - 1)
which means a need not go higher than low + m.
Secondly, we can store results for all f(a, excluded'), keyed by excluded' (we have 2^10 possible keys), each in a decorated binary tree ordered by a. The decoration will be the highest result achievable in the right subtree, meaning we can find the max for all f(v, excluded'), v ≥ a in logarithmic time.
The latter establishes a dominance relationship and clearly we are intetested in both a larger a and a larger f(a, excluded') so as to maximise the min function in (2). Picking an a in the middle, we can use a binary search. If we have:
a - low < max(v, excluded'), v ≥ a
where max(v, excluded') is the lookup
for a in the decorated tree
then we look to the right since max(v, excluded) indicates there's a better answer on the right, where a - low is also larger.
And if we have:
a - low ≥ max(v, excluded), v ≥ a
then we record this candidate and look to the left since to the right, the answer is fixed at max(v, excluded), given that a - low could not decrease.
In order to conduct the binary search on the range, [low, low + m] (see (3)), rather than merge and label all the arrays at the outset, we can keep them separate and compare the closest candidates to mid out of each array we are currently permitted to choose a from. (The trees have the mixed results, keyed by subset.) (The flow of this part is not completely clear to me.)
Worst case with this method, given that n = C is constant seems to be
O(C * array_length * 2^C * C * log(array_length) * log(C * array_length))
C * array_length is the iteration on low
Each low can be paired with 2^C inclusions
C * log(array_length) is the separated binary-search
And log(C * array_length) is the tree lookup
Simplifying:
= O(array_length * log^2(array_length))
although in practice, there could be many dead-end branches that exit early where a full selection wouldn't be possible.
In case, it wasn't clear, the iteration is on a fixed lowest element in the selection. In other words, we want the best f(low, excluded) for all different lows (and excludeds). For bottom-up, we would iterate from the highest value down so our results for a get stored as we iterate.

Find the m by m square that contains the most "conflicting pairs"?

There are two types of units on a 2d plane, green units (G) and red units (R).
The plane is represented as an n by n matrix, each unit is represented as an element in the matrix.
A pair of two units is called a "conflicting pair" if the two are of different colours. The goal is to find the m by m submatrix that contains the most "conflicting pairs".
Example
[R R 0 0 0
R R 0 0 0
0 0 R R 0
0 0 0 G G
0 0 0 G G]
In the above 5 by 5 matrix, the "most conflicting" 3 by 3 submatrix is at the lower right corner, where there are two red units and four green units, which amounts to 8 conflicting pairs within the submatrix.
A naive solution will take O(m^2n^2) for iterating every element in every possible submatrix.
I also thought of using dynamic programming like the Summed-area table algorithm, the time complexity will then be O(n^2), which looks good since it's already O(n^2) for scanning each element once.
However the n by n matrix may be large and sparse and given in a sparse format (like CSR), in that case an O(n^2) algorithm may not be efficient. Any suggeststions on how do I do better for sparse matrices (and dense matrices)?
If you have k non-empty cells (with R or G) then you can solve with time complexity O(k^2) (squeeze the matrix) because optimal submatrix has one non-empty cell on the border of the matrix.
Or time complexity maybe O(k * (log n)^2) if use two dimension sparse segments tree for getting sum on a rectangle.
The answer is given by
idx = argmax SUM(X_r,m) * SUM(X_g,m)
where SUM(X,m) returns a matrix with the summation of units in each m x m window, X_r and X_g are the matrices with only red and green units enabled respectively, and idx is the m x m window with the largest number of conflicting nodes.
The question then becomes can SUM(X,m) be more efficiently calculated for sparse matrices. I think the answer is: it really depends on the structure of X and the value of m.
An obvious way to make use of the sparsity of X is to compute SUM(X,m) by using the identity
SUM(X,m) = transpose(SUM1d( transpose(SUM1d(X,m) ), m )) (1)
where SUM1d(X,m) is the results of summing intervals of length m along rows of X. Clearly, SUM1d can be implemented in O(n) time for each row, and O(n^2) for the entire matrix, in a similar fashion to the Sum-Area-Table algorithm. This yields the same complexity O(n^2) for the entire algorithm. But that is rather uninteresting as it's the same runtime as a Sum-Area-Table algorithm.
What is interesting is asking whether SUM1d(X,m) can be implemented to take advantage of any sparsity of X. It's clear that SUM1d can be implemented to take full advantage of the sparsity of the input matrix; however, depending on the structure of X and the size of m the output matrix may not be sparse.
Assuming, m is much less than n then implementing SUM1d(X,m) as described in eq (1) above can be done in O(nz_row) time where nz_row is the max number of non-zero elements on any of the rows of X. Furthermore, SUM1d(X,m) will produce a sparse matrix, albeit with O(m) less sparsity. Since we assume m is much less than n this is still a sparse matrix and will still translate to efficiency gains.
Therefore, we should expect O(n*nz_row) for the first call to SUM1d in eq (1) and O(n*m*nz_col) for the second call to SUM1d.

Generate random intervals with given density and overlapping

I am in search of algorithm (possibly, approximate) that will generate test data.
We have large integer interval [0, n), n may be up to 10^9. We want to generate a number of smaller intervals (possibly, overlapping) of length k each, all of which fit inside this large interval and also satisfy following properties:
Number of "cells" covered by these intervals divided by n must be equal to density (<=1.0)
Every cell covered by at least one interval is actually covered by overlapping (>=1.0) intervals on average. E.g. degenerate case of overlap_factor=1.0 means that no two intervals intersect.
Interval positions should be distributed uniformly randomly in all other respects
Achieving both (1) and (2) is what makes this problem difficult. The algorithm should produce an array of interval positions.
Image below demonstrates one of the solutions for n=20, k=4, density=0.5, overlapping=1.6:
◼◼◼◼
◼◼◼◼
◼◼◼◼ ◼◼◼◼
↓↓↓
◻◻◻◼◼◼◼◼◼◻◻◻◼◼◼◼◻◻◻◻
0 19
density = 10/20 = 0.5
overlapping = 4*4/10 = 1.6
Real-world applications will opearte with larger values: n ≤ 10^9, k ∈ [1 .. 10^6], density ∈ [0.01 .. 1.0], overlapping ∈ [1.0..5.0].
Because this algorithm is intended to generate test data, approximate solution would be fine.

Naive way to find largest block in a rectangle of 1's and 0's

I'm trying to come up with the bruteforce(naive) solution to find the largest block of 1 or 0 in a rectangle of 1 and 0. I know optimal ways which can do it in O(n) time where n is the total size of rectangle.
1 1 0 1 0 1
1 0 0 0 1 1
1 0 0 0 1 1
1 1 0 1 1 0
In the above rectangle, it is rectangle starting at (Row 2, Col 2) of size 6. I was thinking this..
Go through each element and then find the size it makes by iterating
in all directions from it.
Is it bruteforce? What will be the complexity? I'm going through all elements that is n, but then I'm iterating in all directions, how much will that be?
I'm aware that this question has been asked 100 times, but they talk about optimal solutions. What I'm looking for is a bruteforce solution and its complexity?
The algorithm you described looks somehow like this C code:
//for each entry
for(int x = 0; x < width; ++x)
for(int y = 0; y < height; ++y)
{
char lookFor = entries[x][y];
int area = 0;
for(int row = y; row < height; ++row)
{
if(entries[x, row] != lookFor)
break;
for(int col = x; col < width; ++col)
{
if(entries[col, row] != lookFor)
break;
int currentArea = (col - x + 1) * (row - y + 1);
if(currentArea > area)
{
//save the current rect
}
}
}
}
There are four nested loops. The outer loops will iterate exactly n times (with n being the number of entries). The inner loops will iterate width * f1 and height * f2 times in average (with f1 and f2 being some constant fraction). The rest of the algorithm takes constant time and does not depend on the problem size.
Therefore, the complexity is O(n * f1 * width * f2 * height) = O(n^2), which essentially means "go to each entry and from there, visit each entry again", regardless of whether all entries really need to be checked or just a constant fraction that increases with the problem size.
Edit
The above explanations assume that the entries are not distributed randomly and that for larger fields it is more likely to find larger homogeneous subregions. If this is not the case and the average iteration count for the inner loops does not depend on the field size at all (e.g. for randomly distributed entries), then the resulting time complexity is O(n)
Brute force is generally split into two (sometimes sequential) parts. The first part is generating all possible candidates for solutions to the problem. The second part is testing them to see if they actually are solutions.
Brute force: Assume your rectangle is m x n. Generate all sub-rectangles of size a x b where a is in {1..m} and b is in {1..n}. Set a maximum variable to 0. Test all sub-rectangles to see if it is all 0s and 1s. If it is, let maximum = max(maximum, size(sub-rectangle). Alternatively simply start by testing the larger sub-rectangles and move towards testing smaller sub-rectangles. As soon as you find a sub-rectangle with all 0s or 1s, stop. The time complexity will be the same because in the worst-case for both methods, you will still have to iterate through all sub-rectangles.
Time complexity:
Let's count the number of sub-rectangles generated at each step.
There are m*n subrectangles of size 1 x 1.
There are (m-1)*n subrectangles of size 2 x 1.
There are m*(n-1) subrectangles of size 1 x 2.
There are (m-1)*(n-1) subrectangles of size 2 x 2.
... < and so forth >
There are (m-(m-1))*(n-(n-1)) subrectangles of size m x n.
Thus the formula for counting the number of subrectangles of size a x b from a larger rectangle of size m x n is simply:
number_of_subrectangles_of_size_a_b = (m-a) * (m-b)
If we imagine these numbers laid out in an arithmetic series we get
1*1 + 1*2 + ... + 1*n + 2*1 + ... + m*n
This can be factored to:
(1 + 2 + ... + m) * (1 + 2 + ... + n).
These two arithmetic series converge to the order of O(m2) and O(n2) respectively. Thus generating all sub-rectangles of an m*n rectangle is O(m2n2). Now we look at the testing phase.
After generating all sub-rectangles, testing if each sub-rectangle of size a x b is all 0s or all 1s is O(a * b). Unlike the previous step of generating sub-rectangles of size a x b which scales upwards as a x b decreases, this step increases proportionally with the size of a x b.
e.g.: There is 1 sub-rectangle of size m x n. But testing to see if that rectangle is all 0s or all 1s takes O(m*n) time. Conversely there are m*n sub-rectangles of size 1. But testing to see if those rectangles are all 0s or all 1s takes only O(1) time per rectangle.
What you finally end up for the time complexity is a series like this:
O( (m-(m-1))(n-(n-1))*(mn) + (m-(m-1))(n-(n-2))*m(n-1)... + mn*(m-(m-1))(n-(n-1)) )
Note 2 things here.
The largest term in the series is going to be somewhere close to (m
/ 2)(n / 2)(m / 2) (n / 2) which is O(m2n2)
There are m * n terms total in the series.
Thus the testing phase of the brute-force solution will be O(mn * m2n2) = O(m3n3).
Total time complexity is:
O(generating) + O(testing)
= O(m2n2 + m3n3)
= O(m3n3)
If the area of the given rectangle is say, N, we will have O(N3) time complexity.
Look into "connected components" algorithms for additional ideas. What you've presented as a two-dimensional array of binary values looks just like a binary black & white image. An important exception is that in image processing we typically allow a connected component (a blob of 0s or 1s) to have non-rectangular shapes. Some tweaks to the existing multi-pass and single-pass algorithms should be easy to implement.
http://en.wikipedia.org/wiki/Connected-component_labeling
Although it's a more general solution than you need, you could also run a connected components algorithm to find all connected regions (0s or 1s, background or foreground) and then filter the resulting components (a.k.a. blobs). I'll also mention that for foreground components it's preferable to select for "4-connectivity" rather than "8-connectivity," where the former means connectivity is allowed only at pixels above, below, left, and right of a center pixel, and the latter means connectivity is allowed for any of the eight pixels surrounding a center pixel.
A bit farther afield, for very large 2D arrays it may (just may) help to first reduce the search space by creating what we'd call an "image pyramid," meaning a stack of arrays of progressively smaller size: 1/2 each dimension (filled, as needed), 1/4, 1/8, and so on. A rectangle detectable in a reduced resolution image is a good candidate for being the largest rectangle in a very large image (or 2D array of bits). Although that may not be the best solution for whatever cases you're considering, it's scalable. Some effort would be required, naturally, to determine the cost of scaling the array/image versus the cost of maintaining relatively larger lists of candidate rectangles in the original, large image.
Run-length encoding may help you, too, especially since you're dealing with rectangles instead of connected components of arbitrary shape. Run-length encoding would express each row as stretches or "run lengths" of 0s or 1s. This technique was used to speed up connected component algorithms a decade or two ago, and it's still a reasonable approach.
Anyway, that's not the most direct answer to your question, but I hope it helps somewhat.

Bijection on the integers below x

i'm working on image processing, and i'm writing a parallel algorithm that iterates over all the pixels in an image, and changes the surrounding pixels based on it's value. In this algorithm, minor non-deterministic is acceptable, but i'd rather minimize it by only querying distant pixels simultaneously. Could someone give me an algorithm that bijectively maps the integers below n to the integers below n, in a fast and simple manner, such that two integers that are close to each other before mapping are likely to be far apart after application.
For simplicity let's say n is a power of two. Could you simply reverse the order of the least significant log2(n) bits of the number?
Considering the pixels to be a one dimentional array you could use a hash function j = i*p % n where n is the zero based index of the last pixel and p is a prime number chosen to place the pixel far enough away at each step. % is the remainder operator in C, mathematically I'd write j(i) = i p (mod n).
So if you want to jump at least 10 rows at each iteration, choose p > 10 * w where w is the screen width. You'll want to have a lookup table for p as a function of n and w of course.
Note that j hits every pixel as i goes from 0 to n.
CORRECTION: Use (mod (n + 1)), not (mod n). The last index is n, which cannot be reached using mod n since n (mod n) == 0.
Apart from reverting the bit order, you can use modulo. Say N is a prime number (like 521), so for all x = 0..520 you define a function:
f(x) = x * fac mod N
which is bijection on 0..520. fac is arbitrary number different from 0 and 1. For example for N = 521 and fac = 122 you get the following mapping:
which as you can see is quite uniform and not many numbers are near the diagonal - there are some, but it is a small proportion.

Resources