I'm working on a question from a test in Data Structures, I need to suggest a data structure S that will comply with the follwing requirements:
NOTE: S should allow multiple values with the same keys to be inserted
INSERT(S, k): insert object with key k to S with time
complexity O(lg n)
DELETE_OLD(S): Delete the oldest object in S with time complexity
O(lg n)
DELETE_OLD_MIN(S): Delete the oldest object that has the lowest key
in S with time complexity O(lg n)
MAX_COUNT(S): Return the key with the maximum frequency (most
common key in S) with time complexity O(lg n)
FREQ_SUM(S,z): Finding two keys (a and b) in S such that
frequency.a + frequency.b = z with time complexity O(lg n)
I tried some ideas but could not get passed the last two.
EDIT: The question A data structure traversable by both order of insertion and order of magnitude does NOT answer my question. Please do not mark it as duplicate. Thank you.
EDIT #2: Example for what freq_sum(S,z) does:
Suppose that one called freq_sum(S,5) over the data structure that contains: 2, 2, 2, 3, 4, 4, 4, 5, 5
The combination 2 and 5 could be a possible answer, becuase 2 exists 3 times in the structure and 5 exists 2 times, so 3+2=z
You could use a Red-Black to accomplish this.
Red-Black Trees are very fast data structures that adhere to the requirements you have stated above (the last two would require slight modification to the structure).
You would simply have to allow for duplicate keys, since Red-Black Trees follow the properties of Binary Search Trees. Here is an example of a BST allowing duplicate keys
Red-Black Trees are sufficient to maintain running time of:
Search: O(log N)
Insert: O(log N)
Delete: O(log N)
Space : O(N)
EDIT:
You could implement a Self-Balancing Binary Tree, with modification to allow duplicate keys and find the oldest key (see above reference). Building on this, a Splay Tree can meet all of your requirements with an amortized runtime of O(log N)
For finding FREQ_SUM(S,z):
Since search runs with an amortized time of O(log N) and you are searching for 2 nodes in the tree you end up with runtime of O(2*log N). But when considering runtime, scalar constants are ignored; you result with a runtime of O (log N). You then find the node 'z', with runtime of O(log N);
This is the fundamental runtime of a search utilizing a Binary Search Tree, which the Splay tree is built on.
By using the Split operation, you can return two new trees: one that contains all of the elements less than or equal to x, and the other that contains all of the elements greater than x.
Related
for the time efficiency of inserting into binary search tree,
I know that the best/average case of insertion is O(log n), where as the worst case is O(N).
What I'm wondering is if there is any way to ensure that we will always have best/average case when inserting besides implementing an AVL (Balanced BST)?
Thanks!
There is no guaranteed log n complexity without balancing a binary search tree. While searching/inserting/deleting, you have to navigate through the tree in order to position yourself at the right place and perform the operation. The key question is - what is the number of steps needed to get at the right position? If BST is balanced, you can expect on average 2^(i-1) nodes at the level i. This further means, if the tree has k levels (kis called the height of tree), the expected number of nodes in the tree is 1 + 2 + 4 + .. + 2^(k-1) = 2^k - 1 = n, which gives k = log n, and that is the average number of steps needed to navigate from the root to the leaf.
Having said that, there are various implementations of balanced BST. You mentioned AVL, the other very popular is red-black tree, which is used e.g. in C++ for implementing std::map or in Java for implementing TreeMap.
The worst case, O(n), can happen when you don't balance BST and your tree degenerates into a linked list. It is clear that in order to position at the end of the list (which is a worst case), you have to iterate through the whole list, and this requires n steps.
The easiest way is to store two trees in two arrays, merge them and build a new red-black tree with a sorted array which takes O(m + n) times.
Is there an algorithm with less time complexity?
You can merge two red-black trees in time O(m log(n/m + 1)) where n and m are the input sizes and, WLOG, m ≤ n. Notice that this bound is tighter than O(m+n). Here's some intuition:
When the two trees are similar in size (m ≈ n), the bound is approximately O(m) = O(n) = O(n + m).
When one tree is significantly larger than the other (m ≪ n), the bound is approximately O(log n).
You can find a brief description of the algorithm here. A more in-depth description which generalizes to other balancing schemes (AVL, BB[α], Treap, ...) can be found in a recent paper.
I think that if you have a generic Sets (so generic red-black tree) you can't choose the solution which was suggested #Sam Westrick. Because he assumes that all elements in the first set are less then the elements in the second set. Also into the Cormen (the best book to learn algorithm and data structures) specifies this condition to join two red-black tree.
Due to the fact that you need to compare each element in both m and n Red-Black Trees, you will have to deal with a minimum of O(m+n) time complexity, there's a way to do it O(1) space complexity, but that is something else which has nothing to do with your qu. if you are not iterating and checking each element in each Red-Black Tree, you cannot guarantee that your new Red-Black Tree will be sorted. I can think of another way of merging two Red-Black Trees, which called "In-Place Merge using DLL", but this one would also result O(m+n) time complexity.
Convert the given two Red-Black Trees into Doubly Linked List, which has O(m+n) time complexity.
Merge the two sorted Linked Lists, which has O(m+n) time complexity.
Build a Balanced Red-Black Tree from the merged list created in step 2, which has O(m+n) time complexity.
Time complexity of this method is also O(m+n).
So due to the fact you have to compare the elements each Tree with the other elements of the other Tree, you will have to end up with at least O(m+n).
Consider the scenario where data to be inserted in an array is always in order, i.e. (1, 5, 12, 20, ...)/A[i] >= A[i-1] or (1000, 900, 20, 1, -2, ...)/A[i] <= A[i-1].
To support such a dataset, is it more efficient to have a binary search tree or an array.
(Side note: I am just trying to run some naive analysis for a timed hash map of type (K, T, V) and the time is always in order. I am debating using Map<K, BST<T,V>> vs Map<K, Array<T,V>>.)
As I understand, the following costs (worst case) apply—
Array BST
Space O(n) O(n)
Search O(log n) O(n)
Max/Min O(1) O(1) *
Insert O(1) ** O(n)
Delete O(n) O(n)
*: Max/Min pointers
**: Amortized time complexity
Q: I want to be more clear about the question. What kind of data structure should I be using for such a scenario between these two? Please feel free to discuss other data structures like self balancing BSTs, etc.
EDIT:
Please note I didn't consider the complexity for a balanced binary search tree (RBTree, etc). As mentioned, a naive analysis using a binary search tree.
Deletion has been updated to O(n) (didn't consider time to search the node).
Max/Min for skewed BST will cost O(n). But it's also possible to store pointers for Max & Min so overall time complexity will be O(1).
See below the table which will help you choose. Note that I am assuming 2 things:
1) data will always come in sorted order - you mentioned this i.e. if 1000 is the last data inserted, new data will always be more than 1000 - if data does not come in sorted order, insertion can take O(log n), but deletion will not change
2) your "array" is actually similar to java.util.ArrayList. In short, its length is mutable. (it is actually unfair compare a mutable and an immutable data structure) However, if it is a normal array, your deletion will take amortized O(log n) {O(log n) to search and O(1) to delete, amortized if you need to create new array} and insertion will take amortized O(1) {you need to create new array}
ArrayList BST
Space O(n) O(n)
Search O(log n) O(log n) {optimized from O(n)}
Max/Min O(1) O(log n) {instead of O(1) - you need to traverse till the leaf}
Insert O(1) O(log n) {optimized from O(n)}
Delete O(log n) O(log n)
So, based on this, ArrayList seems better
The problem is online
Details: The length of array <= 35000, the number of insertions <= 35000, the number of assignments <= 70000 and the number of queries <= 70000; time limit: 10s (Java:20s).
The vague solution I found online says that I need to maintain intervals using a scapegoat tree and in each node of the scapegoat tree, maintain a functional interval tree to query the kth largest element. I do know how to do the second step, but I don't know how to do the first one.
Let's suppose that we have (semantically) an array like
0: 31337
1: 42
2: 314159
3: 9000
4: 100 .
We have a scapegoat tree where the array entries are ordered by index. Each node of the tree stores the number of left-descendants so that we can search efficiently by index. (This makes the scapegoat implementation simpler too.)
9000(3)
/ \
42(1) 100(0)
/ \
31337(0) 314159(0)
For each subtree, we also maintain a value-ordered BST of values that it contains. This BST can be a scapegoat tree and also has left-descendant counts for implementing selection.
31337: {31337}
42: {42, 31337, 314159}
314159: {314159}
9000: {42, 100, 9000, 31337, 314159}
100: {100}
To insert, we insert into the scapegoat tree, updating the left-descendant counts and inserting the new value into the BSTs as we walk down. The amortized insertion cost is O(log^2 n) if we reconstitute the BSTs in linear time (proof: each value belongs to O(log n) BSTs, so scapegoating is O(log n) per node touched, for a total of O(log^2 n); inserting into O(log n) BSTs above the scapegoated node is O(log^2 n)). To update, we have to delete/insert from the BSTs (O(log^2 n)).
The query path is where things get ugly. Identifying the O(log n) BSTs and singleton sets whose union is the array section is the easy part. The hard part is actually doing the selection. Binary search will yield O(log^3 n)-time queries, because we have O(log n) rounds of selecting in O(log n) arrays, each with a selection cost of O(log n). Perhaps the Frederickson--Johnson algorithm points to an answer, but it's complicated even for arrays.
Partial answer:
The basic key to performance in such cases is a data structure (aka collection) which supports the required operations with O(log n) complexity (or better).
In your case you need insertions and lookups (called assignments and queries in your question).
Because you also ask for "largest" you need a sorted collection. (This rules out collections based on Hashes which have O(1) complexity)
So you should start with binary trees or tries.
It is currently impossible to give more details because your answer is too vague.
Is there one type of set-like data structure supporting merging in O(logn) time and k-th element search in O(logn) time? n is the size of this set.
You might try a Fibonacci heap which does merge in constant amortized time and decrease key in constant amortized time. Most of the time, such a heap is used for operations where you are repeatedly pulling the minimum value, so a check-for-membership function isn't implemented. However, it is simple enough to add one using the decrease key logic, and simply removing the decrease portion.
If k is a constant, then any meldable heap will do this, including leftist heaps, skew heaps, pairing heaps and Fibonacci heaps. Both merging and getting the first element in these structures typically take O(1) or O(lg n) amortized time, so O( k lg n) maximum.
Note, however, that getting to the k'th element may be destructive in the sense that the first k-1 items may have to be removed from the heap.
If you're willing to accept amortization, you could achieve the desired bounds of O(lg n) time for both meld and search by using a binary search tree to represent each set. Melding two trees of size m and n together requires time O(m log(n / m)) where m < n. If you use amortized analysis and charge the cost of the merge to the elements of the smaller set, at most O(lg n) is charged to each element over the course of all of the operations. Selecting the kth element of each set takes O(lg n) time as well.
I think you could also use a collection of sorted arrays to represent each set, but the amortization argument is a little trickier.
As stated in the other answers, you can use heaps, but getting O(lg n) for both meld and select requires some work.
Finger trees can do this and some more operations:
http://en.wikipedia.org/wiki/Finger_tree
There may be something even better if you are not restricted to purely functional data structures (i.e. aka "persistent", where by this is meant not "backed up on non-volatile disk storage", but "all previous 'versions' of the data structure are available even after 'adding' additional elements").