Optimal filling order for binary tree - algorithm

I have a problem where I need to store changing data values v_i (integers) for constant keys i (also integers, in some range say [1;M]). I need to be able to draw quickly a random element weighted by the values v_i, i.e. the probability to draw key k should be v_k/(sum(i=1...M) v_i)
The best idea I could come up with is using a binary tree and storing the partial sum over the values in the subtree rooted at k as the value for key k (still in the range [1;M]). Then, whenever a value changes, I need to update its node and all parent nodes in the tree (takes O(log M) time since the keys are fixed and thus the binary tree is perfectly balanced). Drawing a random element as above also takes O(log M) time (for each level of the tree, one compares the random number say in the range (0,1) against the relative weights of the left subtree, right subtree, and the node itself) and is much faster than the naive algorithm (take random number r, iterate through the elements to find the least k so that sum(i=1...k) < r, sum(i=1...k+1) > r; takes O(M) time).
The question I now have is how to optimize the placement of the tree nodes in memory in order to minimize cache misses. Since all keys are known and remain constant, this is essentially the order in which I should allocate memory for the tree nodes.
Thanks!!

I don't think there is an optimal filling order of a binary tree except something like a pre-order, post-order, in-order filling? Doesn't your question isn't asking how a cache in general can work? Unfortunately I don't know it myself, maybe a more simplier hash-array would be more efficient in your case?

Related

Data structure for a set of keys with efficient find-missing

I'm looking for a data structure that supports the following operations for
integer keys k ranging from 0 to M-1.
O(1) or O(log n) insert(k), erase(k), lookup(k).
O(1) or O(log n) for the special operation find_missing_key() which returns any key not currently present in the structure.
O(n) or O(n log n) space. In particular. should not be O(M).
An obvious implementation would be a "list-of-free-keys" structure, implemented as a heap; but that would take O(M) space. Is there some data structure that fulfills all of the requirements?
Use a binary segment tree.
Each node in the tree represents a range of integers [a,b], and is either a leaf [a,a] or divides into two nodes representing the ranges [a,m] and [m+1, b] where m is (a+b)/2.
Only expand nodes when necessary, so initially we just have a root node for the range [0,M-1] (or [0,M) if you prefer)
In each node, keep a count of how many used/free spots you have in that subtree.
Insertion, lookup, and deletion of x is O(log n): Just keep subdividing until you reach [x,x], and update everything on the path from that node to the root.
find_missing_key is also O(log n): Since you know the size of each segment and how many free elements are in it, you can decide at each node whether to go left or right in order to find a free element.
(EDIT: Incidentally, this also allows you to find the first, or last, or even the i_th free element, at no additional cost.)

Kth minimum in a Range

Given an array of integers and some query operations.
The query operations are of 2 types
1.Update the value of the ith index to x.
2.Given 2 integers find the kth minimum in that range.(Ex if the 2 integers are i and j ,we have to find out the kth minimum between i and j both inclusive).
I can find the Range minimum query using segment tree but could no do so for the kth minimum.
Can anyone help me?
Here is a O(polylog n) per query solution that does actually not assume a constant k, so the k can vary between queries. The main idea is to use a segment tree, where every node represents an interval of array indices and contains a multiset (balanced binary search tree) of the values in the represened array segment. The update operation is pretty straightforward:
Walk up the segment tree from the leaf (the array index you're updating). You will encounter all nodes that represent an interval of array indices that contain the updated index. At every node, remove the old value from the multiset and insert the new value into the multiset. Complexity: O(log^2 n)
Update the array itself.
We notice that every array element will be in O(log n) multisets, so the total space usage is O(n log n). With linear-time merging of multisets we can build the initial segment tree in O(n log n) as well (there's O(n) work per level).
What about queries? We are given a range [i, j] and a rank k and want to find the k-th smallest element in a[i..j]. How do we do that?
Find a disjoint coverage of the query range using the standard segment tree query procedure. We get O(log n) disjoint nodes, the union of whose multisets is exactly the multiset of values in the query range. Let's call those multisets s_1, ..., s_m (with m <= ceil(log_2 n)). Finding the s_i takes O(log n) time.
Do a select(k) query on the union of s_1, ..., s_m. See below.
So how does the selection algorithm work? There is one really simple algorithm to do this.
We have s_1, ..., s_n and k given and want to find the smallest x in a, such that s_1.rank(x) + ... + s_m.rank(x) >= k - 1, where rank returns the number of elements smaller than x in the respective BBST (this can be implemented in O(log n) if we store subtree sizes).
Let's just use binary search to find x! We walk through the BBST of the root, do a couple of rank queries and check whether their sum is larger than or equal to k. It's a predicate monotone in x, so binary search works. The answer is then the minimum of the successors of x in any of the s_i.
Complexity: O(n log n) preprocessing and O(log^3 n) per query.
So in total we get a runtime of O(n log n + q log^3 n) for q queries. I'm sure we could get it down to O(q log^2 n) with a cleverer selection algorithm.
UPDATE: If we are looking for an offline algorithm that can process all queries at once, we can get O((n + q) * log n * log (q + n)) using the following algorithm:
Preprocess all queries, create a set of all values that ever occured in the array. The number of those will be at most q + n.
Build a segment tree, but this time not on the array, but on the set of possible values.
Every node in the segment tree represents an interval of values and maintains a set of positions where these values occurs.
To answer a query, start at the root of the segment tree. Check how many positions in the left child of the root lie in the query interval (we can do that by doing two searches in the BBST of positions). Let that number be m. If k <= m, recurse into the left child. Otherwise recurse into the right child, with k decremented by m.
For updates, remove the position from the O(log (q + n)) nodes that cover the old value and insert it into the nodes that cover the new value.
The advantage of this approach is that we don't need subtree sizes, so we can implement this with most standard library implementations of balanced binary search trees (e.g. set<int> in C++).
We can turn this into an online algorithm by changing the segment tree out for a weight-balanced tree such as a BB[α] tree. It has logarithmic operations like other balanced binary search trees, but allows us to rebuild an entire subtree from scratch when it becomes unbalanced by charging the rebuilding cost to the operations that must have caused the imbalance.
If this is a programming contest problem, then you might be able to get away with the following O(n log(n) + q n^0.5 log(n)^1.5)-time algorithm. It is set up to use the C++ STL well and has a much better big-O constant than Niklas's (previous?) answer on account of using much less space and indirection.
Divide the array into k chunks of length n/k. Copy each chunk into the corresponding locations of a second array and sort it. To update: copy the chunk that changed into the second array and sort it again (time O((n/k) log(n/k)). To query: copy to a scratch array the at most 2 (n/k - 1) elements that belong to a chunk partially overlapping the query interval. Sort them. Use one of the answers to this question to select the element of the requested rank out of the union of the sorted scratch array and fully overlapping chunks, in time O(k log(n/k)^2). The optimum setting of k in theory is (n/log(n))^0.5. It's possible to shave another log(n)^0.5 using the complicated algorithm of Frederickson and Johnson.
perform a modification of the bucket sort: create a bucket that contains the numbers in the range you want and then sort this bucket only and find the kth minimum.
Damn, this solution can't update an element but at least finds that k-th element, here you'll get some ideas so you can think of some solution that provides update. Try pointer-based B-trees.
This is O(n log n) space and O(q log^2 n) time complexity. Later I explained the same with O(log n) per query.
So, you'll need to do the next:
1) Make a "segment tree" over given array.
2) For every node, instead of storing one number, you would store a whole array. The size of that array has to be equal to the number of it's children. That array (as you guessed) has to contain the values of the bottom nodes (children, or the numbers from that segment), but sorted.
3) To make such an array, you would merge two arrays from its two sons from segment tree. But not only that, for every element from the array you have just made (by merging), you need to remember the position of the number before its insertion in merged array (basically, the array from which it comes, and position in it). And a pointer to the first next element that is not inserted from the same array.
4) With this structure, you can check how many numbers there are that are lower than given value x, in some segment S. You find (with binary search) the first number in the array of the root node that is >= x. And then, using the pointers you have made, you can find the results for the same question for two children arrays (arrays of nodes that are children to the previous node) in O(1). You stop to operate this descending for each node that represents the segment that is whole either inside or outside of given segment S. The time complexity is O(log n): O(log n) to find the first element that is >=x, and O(log n) for all segments of decomposition of S.
5) Do a binary search over solution.
This was solution with O(log^2 n) per query. But you can reduce to O(log n):
1) Before doing all I wrote above, you need to transform the problem. You need to sort all numbers and remember the positions for each in original array. Now these positions are representing the array you are working on. Call that array P.
If bounds of the query segment are a and b. You need to find the k-th element in P that is between a and b by value (not by index). And that element represents the index of your result in original array.
2) To find that k-th element, you would do some type of back-tracking with complexity of O(log n). You will be asking the number of elements between index 0 and (some other index) that are between a and b by value.
3) Suppose that you know the answer for such a question for some segment (0,h). Get answers on same type of questions for all segments in tree that begin on h, starting from the greatest one. Keep getting those answers as long as the current answer (from segment (0,h)) plus the answer you got the last are greater than k. Then update h. Keep updating h, until there is only one segment in tree that begins with h. That h is the index of the number you are looking for in the problem you have stated.
To get the answer to such a question for some segment from tree you will spend exactly O(1) of time. Because you already know the answer of it's parent's segment, and using the pointers I explained in the first algorithm you can get the answer for the current segment in O(1).

Range query for a semigroup operator (union)

I'm looking to implement an algorithm, which is given an array of integers and a list of ranges (intervals) in that array, returns the number of distinct elements in each interval. That is, given the array A and a range [i,j] returns the size of the set {A[i],A[i+1],...,A[j]}.
Obviously, the naive approach (iterate from i to j and count ignoring duplicates) is too slow. Range-Sum seems inapplicable, since A U B - B isn't always equal to B.
I've looked up Range Queries in Wikipedia, and it hints that Yao (in '82) showed an algorithm that does this for semigroup operators (which union seems to be) with linear preprocessing time and space and almost constant query time. The article, unfortunately, is not available freely.
Edit: it appears this exact problem is available at http://www.spoj.com/problems/DQUERY/
There's rather simple algorithm which uses O(N log N) time and space for preprocessing and O(log N) time per query. At first, create a persistent segment tree for answering range sum query(initially, it should contain zeroes at all the positions). Then iterate through all the elements of the given array and store the latest position of each number. At each iteration create a new version of the persistent segment tree putting 1 to the latest position of each element(at each iteration the position of only one element can be updated, so only one position's value in segment tree changes so update can be done in O(log N)). To answer a query (l, r) You just need to find sum on (l, r) segment for the version of the tree which was created when iterating through the r's element of the initial array.
Hope this algorithm is fast enough.
Upd. There's a little mistake in my explanation: at each step, at most two positions' values in the segment tree might change(because it's necessary to put 0 to a previous latest position of a number if it's updated). However, it doesn't change the complexity.
You can answer any of your queries in constant time by performing a quadratic-time precomputation:
For every i from 0 to n-1
S <- new empty set backed by hashtable;
C <- 0;
For every j from i to n-1
If A[j] does not belong to S, increment C and add A[j] to S.
Stock C as the answer for the query associated to interval i..j.
This algorithm takes quadratic time since for each interval we perform a bounded number of operations, each one taking constant time (note that the set S is backed by a hashtable), and there's a quadratic number of intervals.
If you don't have additional information about the queries (total number of queries, distribution of intervals), you cannot do essentially better, since the total number of intervals is already quadratic.
You can trade off the quadratic precomputation by n linear on-the-fly computations: after receiving a query of the form A[i..j], precompute (in O(n) time) the answer for all intervals A[i..k], k>=i. This will guarantee that the amortized complexity will remain quadratic, and you will not be forced to perform the complete quadratic precomputation at the beginning.
Note that the obvious algorithm (the one you call obvious in the statement) is cubic, since you scan every interval completely.
Here is another approach which might be quite closely related to the segment tree. Think of the elements of the array as leaves of a full binary tree. If there are 2^n elements in the array there are n levels of that full tree. At each internal node of the tree store the union of the points that lie in the leaves beneath it. Each number in the array needs to appear once in each level (less if there are duplicates). So the cost in space is a factor of log n.
Consider a range A..B of length K. You can work out the union of points in this range by forming the union of sets associated with leaves and nodes, picking nodes as high up the tree as possible, as long as the subtree beneath those nodes is entirely contained in the range. If you step along the range picking subtrees that are as big as possible you will find that the size of the subtrees first increases and then decreases, and the number of subtrees required grows only with the logarithm of the size of the range - at the beginning if you could only take a subtree of size 2^k it will end on a boundary divisible by 2^(k+1) and you will have the chance of a subtree of size at least 2^(k+1) as the next step if your range is big enough.
So the number of semigroup operations required to answer a query is O(log n) - but note that the semigroup operations may be expensive as you may be forming the union of two large sets.

Sorting an n element array with O(logn) distinct elements in O(nloglogn) worst case time

The problem at hand is whats in the title itself. That is to give an algorithm which sorts an n element array with O(logn) distinct elements in O(nloglogn) worst case time. Any ideas?
Further how do you generally handle arrays with multiple non distinct elements?
O(log(log(n))) time is enough for you to do a primitive operation in a search tree with O(log(n)) elements.
Thus, maintain a balanced search tree of all the distinct elements you have seen so far. Each node in the tree additionally contains a list of all elements you have seen with that key.
Walk through the input elements one by one. For each element, try to insert it into the tree (which takes O(log log n) time). If you find you've already seen an equal element, just insert it into the auxiliary list in the already-existing node.
After traversing the entire list, walk through the tree in order, concatenating the auxiliary lists. (If you take care to insert in the auxiliary lists at the right ends, this is even a stable sort).
Simple log(N) space solution would be:
find distinct elements using balanced tree (log(n) space, n+log(n) == n time)
Than you can use this this tree to allways pick correct pivot for quicksort.
I wonder if there is log(log(N)) space solution.
Some details about using a tree:
You should be able to use a red black tree (or other type of tree based sorting algorithm) using nodes that hold both a value and a counter: maybe a tuple (n, count).
When you insert a new value you either create a new node or you increment the count of the node with the value you are adding (if a node with that value already exists). If you just increment the counter it will take you O(logH) where H is the height of the tree (to find the node), if you need to create it it will also take O(logH) to create and position the node (the constants are bigger, but it's still O(logH).
This will ensure that the tree will have no more than O(logn) values (because you have log n distinct values). This means that the insertion will take O(loglogn) and you have n insertions, so O(nloglogn).

Fast sampling and update of weighted items (data structure like red-black trees?)

What is the appropriate data structure for this task?
I have a set of N items. N is large.
Each item has a positive weight value associated with it.
I would like to do the following, quickly:
inner loop:
Sample an item, according it its weight.
[process...]
Update the weight of K items, where K << N.
When I say sample by weight, this is different than uniform sampling. An item's likelihood is in proportion to its weight. So if there are two items, and one has weight .8 and one has weight .2, then they have likelihood 80% and 20% respectively.
The number of items N remains fixed. Weights are in a bounded range, say [0, 1].
Weights do not always sum to one.
A naive approach takes O(n) time steps to sample.
Is there an O(log(n)) algorithm?
What is the appropriate data structure for this task?
I believe that red-black trees are inappropriate, since they treat each item as having equal weight.
Actually, you can use (modified) RB-trees for this. Moreover, a modification of any balanced tree (not necessarily binary) will do.
The trick is to store additional information in each node - in your case, it could be the total weight of the subtree rooted at the node, or something like that.
When you update (ie. insert/delete) the tree, you follow the algorithm for your favorite tree. As you change the structure, you just recalculate the sums of the nodes (which is an O(1) operation for eg. rotations and B-tree splits and joins). When you change the weight of an item, you update the sums of the node's ancestors.
When you sample, you run a modified version of search. You get the sum of all weights in the trees (ie. sum of the root) and generate a positive random number lower than this. Then, you run the search algorithm, where you go to the left node iff the number (which is a quantile you search for) is less than the sum of the left node. If you go to the right node, you subtract the left sum from the quantile.
This description is a little chaotic, but I hope it helps.
This is a problem I had to solve for some Monte Carlo simulations. You can see my current `binary tree' at the link below. I have tried to make it fairly STL-like. My tree has a fixed capacity, which you could get round with a red-black tree approach, which I have talked about trying. It is essentially an implementation of the ideas described by jpacelek.
The method is very robust in coping with inexact floating point numbers, because it almost always sums and compares quantities of the same magnitude, because they are at the same level in the tree.
http://mopssuite.svn.sourceforge.net/viewvc/mopssuite/utils/trunk/include/binary_tree.hpp?view=markup
I like jpalecek's answer. I would add that the simplest way to get your random number would be (1) Generate u from a uniform(0,1) distribution. (2) Multiply u by the sum of all the weights in the tree.
Since N is fixed, you can solve this by using an array, say v, where v[i+1] = v[i] + weight[i], v[1] = weight[0], v[0] = 0 and sample it by binary search, which is log(N), for a lower bound of a random number uniformly distributed between 0 and the sum of the weights.
Naive update of K items is O(KN), a more thoughtful one is O(N).
Spoiling yet another interview question by the circle of smug :)

Resources