Data structure for a set of keys with efficient find-missing - algorithm

I'm looking for a data structure that supports the following operations for
integer keys k ranging from 0 to M-1.
O(1) or O(log n) insert(k), erase(k), lookup(k).
O(1) or O(log n) for the special operation find_missing_key() which returns any key not currently present in the structure.
O(n) or O(n log n) space. In particular. should not be O(M).
An obvious implementation would be a "list-of-free-keys" structure, implemented as a heap; but that would take O(M) space. Is there some data structure that fulfills all of the requirements?

Use a binary segment tree.
Each node in the tree represents a range of integers [a,b], and is either a leaf [a,a] or divides into two nodes representing the ranges [a,m] and [m+1, b] where m is (a+b)/2.
Only expand nodes when necessary, so initially we just have a root node for the range [0,M-1] (or [0,M) if you prefer)
In each node, keep a count of how many used/free spots you have in that subtree.
Insertion, lookup, and deletion of x is O(log n): Just keep subdividing until you reach [x,x], and update everything on the path from that node to the root.
find_missing_key is also O(log n): Since you know the size of each segment and how many free elements are in it, you can decide at each node whether to go left or right in order to find a free element.
(EDIT: Incidentally, this also allows you to find the first, or last, or even the i_th free element, at no additional cost.)

Related

avl tree in order traversal

Let S be a dynamic set of integers. Let n = |S|. Describe a data structure on S to
support the following operations on S with the required performance guarantees:
• Insert a new element to S in O(log n) time.
• Delete an element from S in O(log n) time.
• Report the k smallest elements of S in O(k) time, for any k satisfying 1 ≤ k ≤ n.
Your structure must consume O(n) space at all times.
could i simply build an AVL tree and then preform in order traversal printing out first 3 elements?
Use a priority queue, which uses O(n) space. Insert and delete are both O(log n) operations. Find-min is O(1), followed by remove-min which is O(k), which you can repeat three times.
Your suggested solution does not work in O(k) time. Starting an in-order traversal takes O(log n) time. Even when you stop after you find one element, you still need to get to the left-most leaf first.
I can think of two solutions:
Use a skip-list, it has O(log n) insertion and deletion and the underlying list is sorted. However, the time complexity of the insertion and deletion is average and not worst-case.
Use a modified AVL-tree, keep a dedicated pointer to the left-most node and update it on insertions and deletions. Also nodes must have pointers to their parent node, so that you could do in-order traversal starting at the left-most node. Or use a threaded tree.
ive decided to store and updated left most node pointer at the root, and pointers to predecessor and successor at every node
thus on every insertion it costs O(logn) for insertions and O(logn)+O(logn) to find both predecessor and successor, O(logn)+O(logn)+O(logn) = O(logn)
these pointer also allow for the update of effected node pointers after insertion in O(1) to travel to predecessor or successor respectively and update there pointers also in O(1)
having pointer to left most node at the root means travelling there is O(1) and then traversing the successor and printing first k nodes in O(k) time.

Kth minimum in a Range

Given an array of integers and some query operations.
The query operations are of 2 types
1.Update the value of the ith index to x.
2.Given 2 integers find the kth minimum in that range.(Ex if the 2 integers are i and j ,we have to find out the kth minimum between i and j both inclusive).
I can find the Range minimum query using segment tree but could no do so for the kth minimum.
Can anyone help me?
Here is a O(polylog n) per query solution that does actually not assume a constant k, so the k can vary between queries. The main idea is to use a segment tree, where every node represents an interval of array indices and contains a multiset (balanced binary search tree) of the values in the represened array segment. The update operation is pretty straightforward:
Walk up the segment tree from the leaf (the array index you're updating). You will encounter all nodes that represent an interval of array indices that contain the updated index. At every node, remove the old value from the multiset and insert the new value into the multiset. Complexity: O(log^2 n)
Update the array itself.
We notice that every array element will be in O(log n) multisets, so the total space usage is O(n log n). With linear-time merging of multisets we can build the initial segment tree in O(n log n) as well (there's O(n) work per level).
What about queries? We are given a range [i, j] and a rank k and want to find the k-th smallest element in a[i..j]. How do we do that?
Find a disjoint coverage of the query range using the standard segment tree query procedure. We get O(log n) disjoint nodes, the union of whose multisets is exactly the multiset of values in the query range. Let's call those multisets s_1, ..., s_m (with m <= ceil(log_2 n)). Finding the s_i takes O(log n) time.
Do a select(k) query on the union of s_1, ..., s_m. See below.
So how does the selection algorithm work? There is one really simple algorithm to do this.
We have s_1, ..., s_n and k given and want to find the smallest x in a, such that s_1.rank(x) + ... + s_m.rank(x) >= k - 1, where rank returns the number of elements smaller than x in the respective BBST (this can be implemented in O(log n) if we store subtree sizes).
Let's just use binary search to find x! We walk through the BBST of the root, do a couple of rank queries and check whether their sum is larger than or equal to k. It's a predicate monotone in x, so binary search works. The answer is then the minimum of the successors of x in any of the s_i.
Complexity: O(n log n) preprocessing and O(log^3 n) per query.
So in total we get a runtime of O(n log n + q log^3 n) for q queries. I'm sure we could get it down to O(q log^2 n) with a cleverer selection algorithm.
UPDATE: If we are looking for an offline algorithm that can process all queries at once, we can get O((n + q) * log n * log (q + n)) using the following algorithm:
Preprocess all queries, create a set of all values that ever occured in the array. The number of those will be at most q + n.
Build a segment tree, but this time not on the array, but on the set of possible values.
Every node in the segment tree represents an interval of values and maintains a set of positions where these values occurs.
To answer a query, start at the root of the segment tree. Check how many positions in the left child of the root lie in the query interval (we can do that by doing two searches in the BBST of positions). Let that number be m. If k <= m, recurse into the left child. Otherwise recurse into the right child, with k decremented by m.
For updates, remove the position from the O(log (q + n)) nodes that cover the old value and insert it into the nodes that cover the new value.
The advantage of this approach is that we don't need subtree sizes, so we can implement this with most standard library implementations of balanced binary search trees (e.g. set<int> in C++).
We can turn this into an online algorithm by changing the segment tree out for a weight-balanced tree such as a BB[α] tree. It has logarithmic operations like other balanced binary search trees, but allows us to rebuild an entire subtree from scratch when it becomes unbalanced by charging the rebuilding cost to the operations that must have caused the imbalance.
If this is a programming contest problem, then you might be able to get away with the following O(n log(n) + q n^0.5 log(n)^1.5)-time algorithm. It is set up to use the C++ STL well and has a much better big-O constant than Niklas's (previous?) answer on account of using much less space and indirection.
Divide the array into k chunks of length n/k. Copy each chunk into the corresponding locations of a second array and sort it. To update: copy the chunk that changed into the second array and sort it again (time O((n/k) log(n/k)). To query: copy to a scratch array the at most 2 (n/k - 1) elements that belong to a chunk partially overlapping the query interval. Sort them. Use one of the answers to this question to select the element of the requested rank out of the union of the sorted scratch array and fully overlapping chunks, in time O(k log(n/k)^2). The optimum setting of k in theory is (n/log(n))^0.5. It's possible to shave another log(n)^0.5 using the complicated algorithm of Frederickson and Johnson.
perform a modification of the bucket sort: create a bucket that contains the numbers in the range you want and then sort this bucket only and find the kth minimum.
Damn, this solution can't update an element but at least finds that k-th element, here you'll get some ideas so you can think of some solution that provides update. Try pointer-based B-trees.
This is O(n log n) space and O(q log^2 n) time complexity. Later I explained the same with O(log n) per query.
So, you'll need to do the next:
1) Make a "segment tree" over given array.
2) For every node, instead of storing one number, you would store a whole array. The size of that array has to be equal to the number of it's children. That array (as you guessed) has to contain the values of the bottom nodes (children, or the numbers from that segment), but sorted.
3) To make such an array, you would merge two arrays from its two sons from segment tree. But not only that, for every element from the array you have just made (by merging), you need to remember the position of the number before its insertion in merged array (basically, the array from which it comes, and position in it). And a pointer to the first next element that is not inserted from the same array.
4) With this structure, you can check how many numbers there are that are lower than given value x, in some segment S. You find (with binary search) the first number in the array of the root node that is >= x. And then, using the pointers you have made, you can find the results for the same question for two children arrays (arrays of nodes that are children to the previous node) in O(1). You stop to operate this descending for each node that represents the segment that is whole either inside or outside of given segment S. The time complexity is O(log n): O(log n) to find the first element that is >=x, and O(log n) for all segments of decomposition of S.
5) Do a binary search over solution.
This was solution with O(log^2 n) per query. But you can reduce to O(log n):
1) Before doing all I wrote above, you need to transform the problem. You need to sort all numbers and remember the positions for each in original array. Now these positions are representing the array you are working on. Call that array P.
If bounds of the query segment are a and b. You need to find the k-th element in P that is between a and b by value (not by index). And that element represents the index of your result in original array.
2) To find that k-th element, you would do some type of back-tracking with complexity of O(log n). You will be asking the number of elements between index 0 and (some other index) that are between a and b by value.
3) Suppose that you know the answer for such a question for some segment (0,h). Get answers on same type of questions for all segments in tree that begin on h, starting from the greatest one. Keep getting those answers as long as the current answer (from segment (0,h)) plus the answer you got the last are greater than k. Then update h. Keep updating h, until there is only one segment in tree that begins with h. That h is the index of the number you are looking for in the problem you have stated.
To get the answer to such a question for some segment from tree you will spend exactly O(1) of time. Because you already know the answer of it's parent's segment, and using the pointers I explained in the first algorithm you can get the answer for the current segment in O(1).

Find median in O(1) in binary tree

Suppose I have a balanced BST (binary search tree). Each tree node contains a special field count, which counts all descendants of that node + the node itself. They call this data structure order statistics binary tree.
This data structure supports two operations of O(logN):
rank(x) -- number of elements that are less than x
findByRank(k) -- find the node with rank == k
Now I would like to add a new operation median() to find the median. Can I assume this operation is O(1) if the tree is balanced?
Unless the tree is complete, the median might be a leaf node. So in general case the cost will be O(logN). I guess there is a data structure with requested properties and with a O(1) findMedian operation (Perhaps a skip list + a pointer to the median node; I'm not sure about findByRank and rank operations though) but a balanced BST is not one of them.
If the tree is complete (i.e. all levels completely filled), yes you can.
In a balanced order statistics tree, finding the median is O(log N). If it is important to find the median in O(1) time, you can augment the data structure by maintaining a pointer to the median. The catch, of course, is that you would need to update this pointer during each Insert or Delete operation. Updating the pointer would take O(log N) time, but since those operations already take O(log N) time, the extra work of updating the median pointer does not change their big-O cost.
As a practical matter, this only makes sense if you do a lot of "find median" operations compared to the number of insertions/deletions.
If desired, you could reduce the cost of updating the median pointer during Insert/Delete to O(1) by using a (doubly) threaded binary tree, but Insert/Delete would still be O(log N).

Sorting an n element array with O(logn) distinct elements in O(nloglogn) worst case time

The problem at hand is whats in the title itself. That is to give an algorithm which sorts an n element array with O(logn) distinct elements in O(nloglogn) worst case time. Any ideas?
Further how do you generally handle arrays with multiple non distinct elements?
O(log(log(n))) time is enough for you to do a primitive operation in a search tree with O(log(n)) elements.
Thus, maintain a balanced search tree of all the distinct elements you have seen so far. Each node in the tree additionally contains a list of all elements you have seen with that key.
Walk through the input elements one by one. For each element, try to insert it into the tree (which takes O(log log n) time). If you find you've already seen an equal element, just insert it into the auxiliary list in the already-existing node.
After traversing the entire list, walk through the tree in order, concatenating the auxiliary lists. (If you take care to insert in the auxiliary lists at the right ends, this is even a stable sort).
Simple log(N) space solution would be:
find distinct elements using balanced tree (log(n) space, n+log(n) == n time)
Than you can use this this tree to allways pick correct pivot for quicksort.
I wonder if there is log(log(N)) space solution.
Some details about using a tree:
You should be able to use a red black tree (or other type of tree based sorting algorithm) using nodes that hold both a value and a counter: maybe a tuple (n, count).
When you insert a new value you either create a new node or you increment the count of the node with the value you are adding (if a node with that value already exists). If you just increment the counter it will take you O(logH) where H is the height of the tree (to find the node), if you need to create it it will also take O(logH) to create and position the node (the constants are bigger, but it's still O(logH).
This will ensure that the tree will have no more than O(logn) values (because you have log n distinct values). This means that the insertion will take O(loglogn) and you have n insertions, so O(nloglogn).

Optimal filling order for binary tree

I have a problem where I need to store changing data values v_i (integers) for constant keys i (also integers, in some range say [1;M]). I need to be able to draw quickly a random element weighted by the values v_i, i.e. the probability to draw key k should be v_k/(sum(i=1...M) v_i)
The best idea I could come up with is using a binary tree and storing the partial sum over the values in the subtree rooted at k as the value for key k (still in the range [1;M]). Then, whenever a value changes, I need to update its node and all parent nodes in the tree (takes O(log M) time since the keys are fixed and thus the binary tree is perfectly balanced). Drawing a random element as above also takes O(log M) time (for each level of the tree, one compares the random number say in the range (0,1) against the relative weights of the left subtree, right subtree, and the node itself) and is much faster than the naive algorithm (take random number r, iterate through the elements to find the least k so that sum(i=1...k) < r, sum(i=1...k+1) > r; takes O(M) time).
The question I now have is how to optimize the placement of the tree nodes in memory in order to minimize cache misses. Since all keys are known and remain constant, this is essentially the order in which I should allocate memory for the tree nodes.
Thanks!!
I don't think there is an optimal filling order of a binary tree except something like a pre-order, post-order, in-order filling? Doesn't your question isn't asking how a cache in general can work? Unfortunately I don't know it myself, maybe a more simplier hash-array would be more efficient in your case?

Resources