Different Definitions of Balance Factor - data-structures

I keep seeing the balance factor of a binary tree defined differently.
My textbook and the Wikipedia AVL Tree page define it as:
Balance Factor(X) = Height ( RightSubtree ( X ) ) − Height ( LeftSubtree ( X ) )
Yet other places define it in exactly the opposite fashion:
Balance Factor(X) = Height ( LeftSubtree ( X ) ) − Height ( RightSubtree ( X ) )
What am I missing?

It actually doesn't matter how you define the balance factor as long as you do so consistently within the same tree. The algorithms for repairing an AVL tree in response to different balance factors are symmetric: the way you fix an imbalance of -2 is the mirror image of the way you fix an imbalance of +2, so there's no real difference between imbalances in the positive versus negative direction. If you were to multiply all the imbalances in the tree by -1, you wouldn't notice anything.
So in a sense, either definition would be fine. You just need to make sure that you're consistent about it so that you don't try doing rotations that expect one set of balance factors when you're using the other.

Actually, both of these balancing factors are correct but you have to stick to the same formula for one tree. You can't use one factor for one node and another for the other. The same tree must have the same formula for the balance factor. You have to take take care that no Factor exceeds +2 or goes below -2.

Related

Recurrence: T(n/4) + T(n/2) + n^2 with tree methods

I'm trying to solve this exercise with tree method but I've a doubt about two parts:
1) In the T(?) column, is it correct using (n^2/2^i) instead of (n/2^i)? I'm asking because this is the part which cause me the error;
2) Is the last multiplication correct (it's between the number of nodes and the time)? After finding the i value I have to create a serie which starts from 0 to the result of the multiplication, right? And as variable of the serie have I to use 2^i (the number of nodes)?
The column for the number of nodes is misleading.
Each node has a cost of (m/k)^2 where k is whatever the denominator of the node is. With the structure you have using, the nodes in each level will have a variety of denominators. For example, your level 2 should contain the nodes [(m/16), (m/8)], [(m/8), (m/4)].
The cost for a level is the sum of the cost of each node in that level. Since each node has a different cost, you cannot multiply the number of nodes by a value to find the cost of a level, you have to add them up individually.
The total cost is the sum of the cost of each level. The result of this calculation may result in a logarithm, or it may not. It depends on the cost of each level and the number of levels.
Hint: Pascal's Triangle

Missing number in binary search tree

If I have order statistic binary balanced tree that has n different integers as its keys and I want to write function find(x) that returns the minimal integer that is not in the tree, and is greater than x. in O(log(n)) time.
For example, if the keys in the tree are 6,7,8,10,11,13,14 then find(6)=9, find(8)=9, find(10)=12, find(13)=15.
I think about finding the max in O(log(n)) and the index of x (mark i_x) in O(log(n)) then if i_x=n-(m-x) then I can simply return max+1.
By index I mean in 6,7,8,10,11,13,14 that index of 6 is 0 and index of 10 is 3 for example...
But I'm having trouble with the other cases...
According to wikipedia, an order statistic tree supports those two operations in log(n) time:
Select(i) — find the i'th smallest element stored in the tree in O(log(n))
Rank(x) – find the rank of element x in the tree, i.e. its index in the sorted list of elements of the tree in O(log(n))
Start by getting the rank of x, and select the superior ranks of x until you find a place to insert your missing element. But this has worst-case n*log(n).
So instead, once you have the rank of x, you do a kind of binary search. The basic idea is whether there is a space between number x and y which are in the tree. There is a space if rank(x) - rank(y) != x - y.
General case is: when searching for the number in the interval [lo,hi] (lo and hi are ranks in the tree, mid is the middle rank), if there is a space between lo and mid then search inside [lo,mid], else search inside [mid, hi].
You will end up finding the number you seek.
However, this solution does not run in log(n) time, but in log^2(n). This is the best I can think of for a general solution.
EDIT:
Well, it's a tough question, I changed my mind several times. Here is what I came up with:
I assume that the left node holds inferior value and the right node holds superior value
Intuition of find(x): Start at the root and go down the tree almost like in a standard binary tree. If the branch we want to go does not contain the solution of find(x) then cut it.
We'll go through the basic cases first:
If the node I found is null, then I am done, and I return the value I was looking for.
If the current value is less than the one I am looking for, I search for x in the right subtree
If I found the node containing x, then I search for x+1 on the right subtree.
The case where x is in the left subtree is more tricky, because it may contain x, x+1, x+2, x+3, etc up to y-1 where y is the value stored in the current node. In this case, we want to search for y+1 in the right subtree.
However, if all the numbers from x to y are not in the left subtree (that is, there is a gap), then we will find a value in it, so we look into the left subtree for x.
Question is: How to find if the sequence from x to y is present in the subtree ?
The algorithm in python looks like this:
def find(node, x):
if node == null:
return x
if node.data < x:
return find(node.right, x)
if node.data == x:
return find(node.right, x+1)
if is_full(...):
return find(node.right, node.data+1)
return find(node.left, x)
To get the smallest value strictly greater than x which is not in the tree, the first call is find(root, x+1). If you want the smallest value greater than or equals to x that is not in the tree, the first call is find(root, x).
The is_full method checks if the left subtree contains all number from x to node.data-1.
Now, using this as a starting point, I believe you can find a suitable solution by yourself, using the fact that the number of nodes contained in each subtree is stored at the subtree's root.
I faced a similar question.
There were no restrictions about finding greater than some x, simply find the missing element in the BST.
Below is my answer, it is perfectly possible to do so in O(lg(n)) time, with the assumption that, tree is almost balanced. You might want to consider the proof that expected height of the randomly built BST is lg(n) given n elements. I use a simpler notation, O(h) where h = height of the tree, so two things are now separate.
assumptions and/or requirements:
I enhance the data structure. store the count of (left_subtree + right_subtree + 1) at each node.
Obviously, count of a single node is 1
This count is pre-computed and stored at each node
Kindly pardon my multiple notations for not equal to (=/= and !=)
Also note that code might be structured in little better way if one is to write a working code on a machine.
Moreover, I think, at this point in time, that this is correct. I tried as many corner cases as I could think of, and in general it works. Even if there is a counter example, I don;t think it will be that difficult to modify the code to fit that particular case; but please comment the counter example, I am interested.

How to efficiently check whether it's height balanced for a massively skewed binary search tree?

I was reading this answer on how to check if a BST is height balanced, and really hooked by the bonus question:
Suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?
What would be a good strategy here?
I am thinking to do a level order traversal and track the depth, if a leaf is found and current node depth is bigger than the leaf node depth + 2, then it's not balanced. But how to combine this with height checking?
Edit: below is the implementation in the linked answer
IsHeightBalanced(tree)
return (tree is empty) or
(IsHeightBalanced(tree.left) and
IsHeightBalanced(tree.right) and
abs(Height(tree.left) - Height(tree.right)) <= 1)
To review briefly: a tree is defined as being either null or a root node with pointers .left to a left child and .right to a right child, where each child is in turn a tree, the root node appears in neither child, and no node appears in both children. The depth of a node is the number of pointers that must be followed to reach it from the root node. The height of a tree is -1 if it's null or else the maximum depth of a node that appears in it. A leaf is a node whose children are null.
First let me note the two distinct definitions of "balanced" proposed by answerers of the linked question.
EL-balanced A tree is EL-balanced if and only if, for every node v, |height(v.left) - height(v.right)| <= 1.
This is the balance condition for AVL trees.
DF-balanced A tree is DF-balanced if and only if, for every pair of leaves v, w, we have |depth(v) - depth(w)| <= 1. As DF points out, DF-balance for a node implies DF-balance for all of its descendants.
DF-balance is used for no algorithm known to me, though the balance condition for binary heaps is very similar, requiring additionally that the deeper leaves be as far left as possible.
I'm going to outline three approaches to testing balance.
Size bounds for balanced trees
Expand the recursive function to have an extra parameter, maxDepth. For each recursive call, pass maxDepth - 1, so that maxDepth roughly tracks how much stack space is left. If maxDepth reaches 0, report the tree as unbalanced (e.g., by returning "infinity" for the height), since no balanced tree that fits in main memory could possibly be that tall.
This approach relies on an a priori size bound on main memory, which is available in practice if not in all theoretical models, and the fact that no subtrees are shared. (PROTIP: unless you're very careful, your subtrees will be shared at some point during development.) We also need height bounds on balanced trees of at most a given size.
EL-balanced Via mutual induction, we prove a lower bound, L(h), on the number of nodes belonging to an EL-balanced tree of a given height h.
The base cases are
L(-1) = 0
L(0) = 1,
more or less by definition. The inductive case is trickier. An EL-balanced tree of height h > 0 is a node with an EL-balanced child of height h - 1 and another EL-balanced child of height either h - 1 or h - 2. This means that
L(h) = 1 + L(h - 1) + min(L(h - 2), L(h - 1)).
Add 1 to both sides and rearrange.
L(h) + 1 = L(h - 1) + 1 + min(L(h - 2) + 1, L(h - 1) + 1).
A little while later (spoiler), we find that
L(h) <= phi^(h + 2)/sqrt(5),
where phi = (1 + sqrt(5))/2 ~ 1.618.
maxDepth then should be set to the floor of the base-phi logarithm of the maximum number of nodes, plus a small constant that depends on fenceposty things.
DF-balanced Rather than write out an induction proof, I'm going to appeal to your intuition that the worst case is a complete binary tree with one extra leaf on the bottom. Then the proper setting for maxDepth is the base-2 logarithm of the maximum number of nodes, plus a small constant.
Iterative deepening depth-first search
This is the theoretician's version of the answer above. Because, for some reason, we don't know how much RAM our computer has (and with logarithmic space usage, it's not as though we need a tight bound), we again include the maxDepth parameter, but this time, we use it to truncate the tree implicitly below the specified depth. If the height of the tree comes back below the bound, then we know that the algorithm ran successfully. Alternatively, if the truncated tree is unbalanced, then so is the whole tree. The problem case is when the truncated tree is balanced but with height equal to maxDepth. Then we increase maxDepth and retry.
The simplest retry strategy is to increase maxDepth by 1 every time. Since balanced trees with n nodes have height O(log n), the running time is O(n log n). In fact, for DF-balanced trees, the running time is also O(n), since, except for the last couple traversals, the size of the truncated tree increases by a factor of 2 each time, leading to a geometric series.
Another strategy, doubling maxDepth each time, gives an O(n) running time for EL-balanced trees, since the largest tree of height h, with 2^(h + 1) - 1 nodes, is much smaller than the smallest tree of height 2h, with approximately (phi^2)^h nodes. The downside of doubling is that we may use twice as much stack space. With increase-by-1, however, in the family of minimum-size EL-balanced trees we constructed implicitly in defining L(h), the number of nodes at depth h - k in the tree of height h is polynomial of degree k. Accordingly, the last few scans will incur some superlinear term.
Temporarily mutating pointers
If there are parent pointers, then it's easy to traverse depth-first in place, because the parent pointers can be used to derive the relevant information on the stack in an efficient manner. If we don't have parent pointers but can mutate the tree temporarily, then, for descent into a child, we can cannibalize the pointer to that child to store temporarily the node's parent. The problem is determining on the way up whether we came from a left or a right child. If we can sneak a bit (say because pointers are 2-byte aligned, or because there's a spare bit in the balance factor, or because we're copying the tree for stop-and-copy garbage collection and can determine which arena we're in), then that's one way. Another test assumes that the tree is a binary search tree. It turns out that we don't need additional assumptions, however: Explain Morris inorder tree traversal without using stacks or recursion .
The one fly in the ointment is that this approach only works, as far as I know, on DF-balance, since there's no space on the stack to put the partial results for EL-balance.

Binary search for no uniform distribution

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.
Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.
There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.
One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.
Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.
Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.
Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.
A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.
Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.
The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.
One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.
A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.
Let me make it precise. What you want for binary search is:
Given array A which is sorted, but have non-uniform distribution
Given left & right index L & R of search range
Want to search for a value X in A
To apply binary search, we want to find the index M in [L,R]
as the next position to look at.
Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]
In general, you of course want to pick M where you think X value should be in A.
Because even if you miss, half the total 'chance' would be eliminated.
So it seems to me you have some expectation about distribution.
If you could tell us what exactly do you mean by '1/x distribution', then
maybe someone here can help build on my suggestion for you.
Let me give a worked example.
I'll use similar interpretation of '1/x distribution' as #Leonid Volnitsky
Here is a Python code that generate the input array A
from random import uniform
# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()
# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]
Let assume the value to search for is:
X = 0.0553
Then the estimated index of X is:
= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)
So how to calculate P(x <= X) ?
It this case it is simple.
We reverse X back to the value between [a,b] which we will call
X' = 1/X ~ 18
Hence
P(x <= X) = (b-X')/(b-a)
= (20-18)/(20-10)
= 2/10
So the expected position of X is:
10*(2/10) = 2
Well, and that's pretty damn accurate!
To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.
I know this might not seems like a binary search anymore
if you can get that close to the answer in just one step.
But admit it, this is what you can do if you know the distribution of input array.
The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.
You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.
This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.
Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:
30
/ \
20 50
/ / \
10 40 60
Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:
f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]
Into a new search tree, that looks like:
10 -------------> the most probable of being taken
/ \ leaving v = [[20, 30], [40, 50, 60]]
20 30 ---------> the most probable of being taken
/ \ leaving v = [[40, 50], [60]]
40 50 -------> the most probable of being taken
/ leaving v = [[60]]
60
If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by #Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.
I will assume from your description:
X is uniformly distributed
Y=1/X is your data which you want to search and it is stored in sorted table
given value y, you need to binary search it in the above table
Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.
For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.
To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.
Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.
The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).
In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).
Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N).
We are looking for key K and want index I so that
[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.
Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).
The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.
Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).
As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).
If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].
The above proportionally weighted binary search can used to improve on the initial probe.

Optimizing cartesian requests with affine costs

I have a cost optimization request that I don't know how if there is literature on. It is a bit hard to explain, so I apologize in advance for the length of the question.
There is a server I am accessing that works this way:
a request is made on records (r1, ...rn) and fields (f1, ...fp)
you can only request the Cartesian product (r1, ..., rp) x (f1,...fp)
The cost (time and money) associated with a such a request is affine in the size of the request:
T((r1, ..., rn)x(f1, ..., fp) = a + b * n * p
Without loss of generality (just by normalizing), we can assume that b=1 so the cost is:
T((r1, ...,rn)x(f1,...fp)) = a + n * p
I need only to request a subset of pairs (r1, f(r1)), ... (rk, f(rk)), a request which comes from the users. My program acts as a middleman between the user and the server (which is external). I have a lot of requests like this that come in (tens of thousands a day).
Graphically, we can think of it as an n x p sparse matrix, for which I want to cover the nonzero values with a rectangular submatrix:
r1 r2 r3 ... rp
------ ___
f1 |x x| |x|
f2 |x | ---
------
f3
.. ______
fn |x x|
------
Having:
the number of submatrices being kept reasonable because of the constant cost
all the 'x' must lie within a submatrix
the total area covered must not be too large because of the linear cost
I will name g the sparseness coefficient of my problem (number of needed pairs over total possible pairs, g = k / (n * p). I know the coefficient a.
There are some obvious observations:
if a is small, the best solution is to request each (record, field) pair independently, and the total cost is: k * (a + 1) = g * n * p * (a + 1)
if a is large, the best solution is to request the whole Cartesian product, and the total cost is : a + n * p
the second solution is better as soon as g > g_min = 1/ (a+1) * (1 + 1 / (n * p))
of course the orders in the Cartesian products are unimportant, so I can transpose the rows and the columns of my matrix to make it more easily coverable, for example:
f1 f2 f3
r1 x x
r2 x
r3 x x
can be reordered as
f1 f3 f2
r1 x x
r3 x x
r2 x
And there is an optimal solution which is to request (f1,f3) x (r1,r3) + (f2) x (r2)
Trying all the solutions and looking for the lower cost is not an option, because the combinatorics explode:
for each permutation on rows: (n!)
for each permutation on columns: (p!)
for each possible covering of the n x p matrix: (time unknown, but large...)
compute cost of the covering
so I am looking for an approximate solution. I already have some kind of greedy algorithm that finds a covering given a matrix (it begins with unitary cells, then merges them if the proportion of empty cell in the merge is below some threshold).
To put some numbers in minds, my n is somewhere between 1 and 1000, and my p somewhere between 1 and 200. The coverage pattern is really 'blocky', because the records come in classes for which the fields asked are similar. Unfortunately I can't access the class of a record...
Question 1: Has someone an idea, a clever simplification, or a reference for a paper that could be useful? As I have a lot of requests, an algorithm that works well on average is what I am looking for (but I can't afford it to work very poorly on some extreme case, for example requesting the whole matrix when n and p are large, and the request is indeed quite sparse).
Question 2: In fact, the problem is even more complicated: the cost is in fact more like the form: a + n * (p^b) + c * n' * p', where b is a constant < 1 (once a record is asked for a field, it is not too costly to ask for other fields) and n' * p' = n * p * (1 - g) is the number of cells I don't want to request (because they are invalid, and there is an additional cost in requesting invalid things). I can't even dream of a rapid solution to this problem, but still... an idea anyone?
Selecting the submatrices to cover the requested values is a form of the set covering problem and hence NP complete. Your problem adds to this already hard problem that the costs of the sets differ.
That you allow to permutate the rows and columns is not such a big problem, because you can just consider disconnected submatrices. Row one, columns four to seven and row five, columns four two seven are a valid set because you can just swap row two and row five and obtain the connected submatrix row one, column four to row two, column seven. Of course this will add some constraints - not all sets are valid under all permutations - but I don't think this is the biggest problem.
The Wikipedia article gives the inapproximability results that the problem cannot be solved in polynomial time better then with a factor 0.5 * log2(n) where n is the number of sets. In your case 2^(n * p) is a (quite pessimistic) upper bound for the number of sets and yields that you can only find a solution up to a factor of 0.5 * n * p in polynomial time (besides N = NP and ignoring the varying costs).
An optimistic lower bound for the number of sets ignoring permutations of rows and columns is 0.5 * n^2 * p^2 yielding a much better factor of log2(n) + log2(p) - 0.5. In consequence you can only expect to find a solution in your worst case of n = 1000 and p = 200 up to a factor of about 17 in the optimistic case and up to a factor of about 100.000 in the pessimistic case (still ignoring the varying costs).
So the best you can do is to use a heuristic algorithm (the Wikipedia article mentions an almost optimal greedy algorithm) and accept that there will be case where the algorithm performs (very) bad. Or you go the other way and use an optimization algorithm and try to find a good solution be using more time. In this case I would suggest trying to use A* search.
I'm sure there's a really good algorithm for this out there somewhere, but here are my own intuitive ideas:
Toss-some-rectangles approach:
Determine a "roughly optimal" rectangle size based on a.
Place these rectangles (perhaps randomly) over your required points, until all points are covered.
Now take each rectangle and shrink it as much as possible without "losing" any data points.
Find rectangles close to each other and decide whether combining them would be cheaper than keeping them separate.
Grow
Start off with each point in its own 1x1 rectangle.
Locate all rectangles within n rows/columns (where n may be based on a); see if you can combine them into one rectangle for no cost (or negative cost :D).
Repeat.
Shrink
Start off with one big rectangle, that covers ALL points.
Look for a sub-rectangle which shares a pair of sides with the big one, but contains very few points.
Cut it out of the big one, producing two smaller rectangles.
Repeat.
Quad
Divide the plane into 4 rectangles. For each of these, see if you get a better cost by recursing further, or by just including the whole rectangle.
Now take your rectangles and see if you can merge any of them with little/no cost.\
Also: keep in mind that sometimes it will be better to have two overlapping rectangles than one large rectangle which is a superset of them. E.g. the case when two rectangles just overlap in one corner.
Ok, my understanding of the question has changed. New ideas:
Store each row as a long bit-string. AND pairs of bit-strings together, trying to find pairs that maximise the number of 1 bits. Grow these pairs into larger groups (sort and try to match the really big ones with each other). Then construct a request that will hit the largest group and then forget about all those bits. Repeat until everything done. Maybe switch from rows to columns sometimes.
Look for all rows/columns with zero, or few, points in them. "Delete" them temporarily. Now you are looking at what would covered by a request that leaves them out. Now perhaps apply one of the other techniques, and deal with the ignored rows/cols afterwards. Another way of thinking about this is: deal with denser points first, and then move onto sparser ones.
Since your values are sparse, could it be that many users are asking for similar values? Is caching within your application an option? The requests could be indexed by a hash that is a function of (x,y) position, so that you can easily identify cached sets that fall within the correct area of the grid. Storing the cached sets in a tree, for example, would allow you to find minimum cached subsets that cover the request range very quickly. You can then do a linear lookup on the subset, which is small.
I'd consider the n records (rows) and p fields (cols) mentioned in the user request set as n points in p-dimensional space ({0,1}^p) with the ith coordinate being 1 iff it has an X, and identify a hierarchy of clusters, with the coarsest cluster at the root including all the X. For each node in the clustering hierarchy, consider the product that covers all the columns needed (this is rows(any subnode) x cols(any subnode)). Then, decide from the bottom up whether to merge the child coverings (paying for the whole covering), or keep them as separate requests. (the coverings are not of contiguous columns, but exactly the ones needed; i.e. think of a bit vector)
I agree with Artelius that overlapping product-requests could be cheaper; my hierarchical approach would need improvement to incorporate that.
I've worked a bit on it, and here is an obvious, O(n^3) greedy, symmetry breaking algorithm (records and fields are treated separately) in python-like pseudo-code.
The idea is trivial: we start by trying one request per record, and we do the most worthy merge until there is nothing left worthy to merge. This algo has the obvious disadvantage that it does not allow overlapping requests, but I expect it to work quite well on real life case (with the a + n * (p^b) + c * n * p * (1 - g) cost function) :
# given are
# a function cost request -> positive real
# a merge function that takes two pairs of sets (f1, r1) and (f2, r2)
# and returns ((f1 U f2), (r1 U r2))
# initialize with a request per record
requests = [({record},{field if (record, field) is needed}) for all needed records]
costs = [cost(request) for request in requests]
finished = False
while not finished: # there might be something to gain
maximum_gain = 0
finished = True
this_step_merge = empty
# loop onto all pairs of request
for all (request1, request2) in (requests x request) such as request1 != request2:
merged_request = merge(request1, request2)
gain = cost(request1) + cost(request2) - cost(merged_request)
if gain > maximum_gain:
maximum_gain = gain
this_step_merge = (request1, request2, merged_request)
# if we found at least something to merge, we should continue
if maximum_gain > 0:
# so update the list of requests...
request1, request2, merged_request = this_step_merge
delete request1 from requests
delete request2 from requests
# ... and we are not done yet
insert merged_request into requests
finished = False
output requests
This is O(n3 * p) because:
after initialization we start with n requests
the while loop removes exactly one request from the pool at each iteration.
the inner for loop iterates on the (ni^2 - ni) / 2 distinct pairs of requests, with ni going from n to one in the worst case (when we merge everything into one big request).
Can someone help me pointing the very bad cases of the algorithm. Does it sound reasonnable to use this one ?
It is O(n^3) which is way too costly for large inputs. Any idea to optimize it ?
Thanks in advance !

Resources