I'm learning data-structures and algorithms. The book I refer to(Sedgwick) uses 'finding the maximum element' to illustrate the divide-and-conquer strategy. This algorithm divides an array midway into two parts, finds the maximum elements in the two parts (recursively), and returns the larger of the two as the maximum element of the whole array.
The below is an exercise question asked
Modify the divide-and-conquer program for finding the maximum element in an array (Program 5.6) to divide an array of size N into one part of size k = 2^(lg N – 1) and another of size, N – k (so that the size of at least one of the parts is a power of 2).
Draw the tree corresponding to the recursive calls that your program makes when the array size is 11, similar to the one shown for Program 5.6.
I see that the left sub-tree of such a binary tree is a perfect binary tree because the size of the first subset is a power of two. What is the implication the author is hoping that I should get from this?
I suppose that one nugget of this exercise lies in the k. It makes the point that if you use this formula for k in a binary recursion, then your underlying tree is "pretty", in the sense that the left subtree of every node (not just the root) is a perfect binary tree.
Of course it is also well-behaved in the "ideal" case when N is a power of 2; k is then simply N/2, and every subtree (not only the left) is a perfect binary tree.
Related
Consider the following list of tuples:
[(5,4,5), (6,9,6), (3,8,3), (7,9,8)]
I am trying to devise an algorithm to check whether there exists at least one tuple in the list where all elements of that tuple are greater than or equal to a given tuple (the needle).
For example, for a given tuple (6,5,7), the algorithm should return True as every element in the given tuple is less than the last tuple in the list, i.e. (7,9,8). However, for a given tuple (9,1,9), the algorithm should return False as there is no tuple in the list where each element is greater than the given tuple. In particular, this is due to the second element 1 of the given tuple, which is smaller than the second element of all tuple in the list.
A naive algorithm would loop through the tuple in the list one by one, and loop through the the element of the tuple in the inner loop. Assuming there are n tuples, where each tuple have m elements, this will give a complexity of O(nm).
I am thinking whether it would be possible to have an algorithm to produce the task with a lower complexity. Pre-processing or any fancy data-structure to store the data is allowed!
My original thought was to make use of some variant of binary search, but I can't seem to find a data structure that allow us to not fall back to the naive solution once we have eliminated some tuples based on the first element, which implies that this algorithm could potentially be O(nm) at the end as well.
Thanks!
Consider the 2-tuple version of this problem. Each tuple (x,y) corresponds to an axis-aligned rectangle on the plane with upper right corner at (x,y) and lower right at (-oo,+oo). The collection corresponds to the union of these rectangles. Given a query point (needle), we need only determine if it's in the union. Knowing the boundary is sufficient for this. It's an axis-aligned polyline that's monotonically non-increasing in y with respect to x: a "downward staircase" in the x direction. With any reasonable data structure (e.g. an x-sorted list of points on the polyline), it's simple to make the decision in O(log n) time for n rectangles. It's not hard to see how to construct the polyline in O(n log n) time by inserting rectangles one at a time, each with O(log n) work.
Here's a visualization. The four dots are input tuples. The area left and below the blue line corresponds to "True" return values:
Tuples A, B, C affect the boundary. Tuple D doesn't.
So the question is whether this 2-tuple version generalizes nicely to 3. The union of semi-infinite axis-aligned rectangles becomes a union of rectangular prisms instead. The boundary polyline becomes a 3d surface.
There exist a few common ways to represent problems like this. One is as an octree. Computing the union of octrees is a well-known standard algorithm and fairly efficient. Querying one for membership requires O(log k) time where k is the biggest integer coordinate range contained in it. This is likely to be the simplest option. But octrees can be relatively slow and take a lot of space if the integer domain is big.
Another candidate without these weaknesses is a Binary Space Partition, which can handle arbitrary dimensions. BSPs use (hyper)planes of dimension n-1 to recursively split n-d space. A tree describes the logical relationship of the planes. In this application, you'll need 3 planes per tuple. The intersection of the "True" half-spaces induced by by the planes will be the True semi-infinite prism corresponding to the tuple. Querying a needle is traversing the tree to determine if you're inside any of the prisms. Average case behavior of BSPs is very good, but worst case size of the tree is terrible: O(n) search time over a tree of size O(2^n). In real applications, tricks are used to find BSPs of modest size at creation time, starting with randomizing insertion order.
K-d trees are another tree-based space partitioning scheme that could be adapted to this problem. This will take some work, though, because most presentations of k-d trees are concerned with searching for points, not representing regions. They'd have the same worst case behavior as BSPs.
The other bad news is that these algorithms aren't well-suited to tuples much bigger than 3. Trees quickly become too big. Searching high dimensional spaces is hard and a topic of active research. However, since you didn't say anything about tuple length, I'll stop here.
This kind of problem is addressed by spatial indexing systems. There are many data structures that allow your query to be executed efficiently.
Let S be a topologically-sorted copy of the original set of n each m-tuples. Then we can use binary search for any test tuple in S, at a cost of O(m ln n) per search (due to at most lg n search plies with at most m comparisons per ply).
Note, suppose there exist tuples P, Q in S such that P ≤ Q (that is, no element of Q is smaller than the corresponding element of P). Then tuple Q can be removed from S. In practice this often might cut the size of S to a small multiple of m, which would give O(m ln m) performance; but in the worst case, will provide no reduction at all.
Trying to answer
allcorrespondingelements greater than or equal to a given tuple (needle)
(using y and z for members of the set/hay stack, x for the query tuple/needle and x ll y when xₐ ≤ yₐ for all ₐ (x dominated by y))
compute telling summary information like min, sum and max of all tuple elements
order criteria by selectivity
weed out dominated tuples
build a k-d-tree
top off with lower and upper bounding boxes:
one tuple lower consisting of the minimum values for each element (if lower dominates x return True)
and upper consisting of the minimum values: return False if x dominates upper
Any help would be appreciated.
You can intersect any two sorted lists in linear time.
get the in-order (left child, then parent data, then right child) iterators for both AVL trees.
peek at the head of both iterators.
if one iterator is exhausted, return the result set.
if both elements are equal or the union is being computed, add their minimum to the result set.
pop the lowest (if the iterators are in ascending order) element. If both are equal, pop both
This runs in O(n1+n2) and is optimal for the union operation (where you are bound by the output size).
Alternatively, you can look at all elements of the smaller tree to see if they are present in the larger tree. This runs in O(n1 log n2).
This is the algorithm Google uses (or considered using) in their BigTable engine to find an intersection:
Get iterators for all sources
Start with pivot = null
loop over all n iterators in sequence until any of them is exhausted.
find the smallest element larger than the pivot in this iterator.
if the element is the pivot
increment the count of the iterators the pivot is in
if this pivot is in all iterators, add the pivot to the result set.
else
reset the count of the iterators the pivot is in
use the found element as the new pivot.
To find an element or the next largest element in a binary tree iterator:
start from the current element
walk up until the current element is larger than the element being searched for or you are in the root
walk down until you find the element or you can't go to the left
if the current element is smaller than the element being searched, return null (this iterator is exhausted)
else return the current element
This decays to O(n1+n2) for similarly-sized sets that are perfectly mixed, and to O(n1 log n2) if the second tree is much bigger. If the range of a subtree in one tree does not intersect any node in the other tree / all other trees, then at most one element from this subtree is ever visited (its minimum). This is possibly the fastest algorithm available.
Here is a paper with efficient algorithms for finding intersections and unions of AVL trees (or other kinds of trees for that matter, and other operations).
Implementing Sets Efficiently in a Functional Language
I found this paper when I was researching this subject. The algorithms are in Haskell and they are primarily designed for immutable trees, but they will work as well for any kind of tree (though there might be some overhead in some languages). Their performance guarantees are similar to the algorithm presented above.
I felt that they are very similar to each other, except at some concepts. In external sorting their functions are basically the same, that is to find the minimal/maximal value in k runs. So are there some significant differences between them two ?
For the most part, loser trees and heaps are quite similar. However, there are a few important distinctions. The loser tree, because it provides the loser of each match, will contain repeat nodes. Since the heap is a data-storing structure, it won't contain these redundancies. Another difference between the two is that the loser tree must be a full binary tree (because it is a type of tournament tree), but the heap does not necessarily have to be binary.
Finally, to understand a specific quality of the loser tree, consider the following problem:
Suppose we have k sequences, each of which is sorted in nondecreasing order, that are to be merged into one sequence in nondecreasing order. This can be achieved by repeatedly transferring the element with the smallest key to an output array. The smallest key has to be found from the leading elements in the k sequences. Ordinarily, this would require k − 1 comparisons for each element transferred. However, with a loser tree, this can be reduced to log2 k comparisons per element.
Source: Handbook of Data Structures and Applications, Dinesh Mehta
I have a problem where I need to store changing data values v_i (integers) for constant keys i (also integers, in some range say [1;M]). I need to be able to draw quickly a random element weighted by the values v_i, i.e. the probability to draw key k should be v_k/(sum(i=1...M) v_i)
The best idea I could come up with is using a binary tree and storing the partial sum over the values in the subtree rooted at k as the value for key k (still in the range [1;M]). Then, whenever a value changes, I need to update its node and all parent nodes in the tree (takes O(log M) time since the keys are fixed and thus the binary tree is perfectly balanced). Drawing a random element as above also takes O(log M) time (for each level of the tree, one compares the random number say in the range (0,1) against the relative weights of the left subtree, right subtree, and the node itself) and is much faster than the naive algorithm (take random number r, iterate through the elements to find the least k so that sum(i=1...k) < r, sum(i=1...k+1) > r; takes O(M) time).
The question I now have is how to optimize the placement of the tree nodes in memory in order to minimize cache misses. Since all keys are known and remain constant, this is essentially the order in which I should allocate memory for the tree nodes.
Thanks!!
I don't think there is an optimal filling order of a binary tree except something like a pre-order, post-order, in-order filling? Doesn't your question isn't asking how a cache in general can work? Unfortunately I don't know it myself, maybe a more simplier hash-array would be more efficient in your case?
What is the appropriate data structure for this task?
I have a set of N items. N is large.
Each item has a positive weight value associated with it.
I would like to do the following, quickly:
inner loop:
Sample an item, according it its weight.
[process...]
Update the weight of K items, where K << N.
When I say sample by weight, this is different than uniform sampling. An item's likelihood is in proportion to its weight. So if there are two items, and one has weight .8 and one has weight .2, then they have likelihood 80% and 20% respectively.
The number of items N remains fixed. Weights are in a bounded range, say [0, 1].
Weights do not always sum to one.
A naive approach takes O(n) time steps to sample.
Is there an O(log(n)) algorithm?
What is the appropriate data structure for this task?
I believe that red-black trees are inappropriate, since they treat each item as having equal weight.
Actually, you can use (modified) RB-trees for this. Moreover, a modification of any balanced tree (not necessarily binary) will do.
The trick is to store additional information in each node - in your case, it could be the total weight of the subtree rooted at the node, or something like that.
When you update (ie. insert/delete) the tree, you follow the algorithm for your favorite tree. As you change the structure, you just recalculate the sums of the nodes (which is an O(1) operation for eg. rotations and B-tree splits and joins). When you change the weight of an item, you update the sums of the node's ancestors.
When you sample, you run a modified version of search. You get the sum of all weights in the trees (ie. sum of the root) and generate a positive random number lower than this. Then, you run the search algorithm, where you go to the left node iff the number (which is a quantile you search for) is less than the sum of the left node. If you go to the right node, you subtract the left sum from the quantile.
This description is a little chaotic, but I hope it helps.
This is a problem I had to solve for some Monte Carlo simulations. You can see my current `binary tree' at the link below. I have tried to make it fairly STL-like. My tree has a fixed capacity, which you could get round with a red-black tree approach, which I have talked about trying. It is essentially an implementation of the ideas described by jpacelek.
The method is very robust in coping with inexact floating point numbers, because it almost always sums and compares quantities of the same magnitude, because they are at the same level in the tree.
http://mopssuite.svn.sourceforge.net/viewvc/mopssuite/utils/trunk/include/binary_tree.hpp?view=markup
I like jpalecek's answer. I would add that the simplest way to get your random number would be (1) Generate u from a uniform(0,1) distribution. (2) Multiply u by the sum of all the weights in the tree.
Since N is fixed, you can solve this by using an array, say v, where v[i+1] = v[i] + weight[i], v[1] = weight[0], v[0] = 0 and sample it by binary search, which is log(N), for a lower bound of a random number uniformly distributed between 0 and the sum of the weights.
Naive update of K items is O(KN), a more thoughtful one is O(N).
Spoiling yet another interview question by the circle of smug :)