I have some question about AVL, lets assume I created some avl-tree of integers, how do I need to manage insertion into my tree to be able to take out the longest sequence of numbers, (insertion have to be with complexity O(logn)), for example:
_ 10 _
_ 7 _ _ 12 _
6 8
in this case the longest sequence will be 6,7,8 so in my function void sequence(int* low, int* high) I'll do *low = 6, *high = 8...
comlexity of the function(sequence) have to be O(1)
thanks in advance for any idea
Actually, if you build an interval list or something very-like-it, then store the components of it in an AVL tree, you could probably do okay. The thing is that you don't just want a given sequence, you want longest sequence. Longest run of lexically immediately adjacent keys*, to be more exact. Which, curiously, is quite hard I think without bashing up a custom metric to build your AVL tree on. I guess if your comparator for the AVL tree built on the interval list was f(length-of-interval), you could get it in o(logn) or maybe faster if your AVL implementation has fast max\min.
I'm terribly sorry, I was hoping to be more help, but the fact that we have to use an AVL tree is a little troubling. I'm wondering if there's a trick that one could pull involving sub-trees, but I'm simply seeing no good way to make such an approach o(1) without so much preprocessing as to be a joke. Something with bloom filters might work?
*
Some total orderings can create similar runs, but not all have a meaningful concept of immediate adjacency in their... well... phase space I guess?**
**My lackluster formal education is really biting me right about now.
The basic insertion & rotation in AVL Trees guarantees close to O(logn) performance.
Coming to the 2nd part of your question, to find the "complexity" of your sequence, you first need to find (or traverse to) the "low" element in your AVL Tree, that itself will take you AT MOST O(logn).
So O(1) sequence() complexity is out of the window... If O(1) is a must then maybe AVL tree is not your data structure here.
Related
I want to try trying executing the worst-case sequence on Splay tree.
But what is the worst-case sequence on Splay-trees?
And are there any way to calculate this sequence easily given the keys which is inserted into the tree?
Any can help me with this?
Unless someone corrects me, I'm going to go with "no one actually knows what the worst-case series of operations is on a splay tree or what the complexity is in that case."
While we do know many results about the efficiency of splay trees, we actually don't know all that much about how to bound the time complexity of a splay tree. There's a conjecture called the dynamic optimality conjecture that says that in the worst case, any sufficiently long series of operations on a splay tree will take no more than a constant amount of time more than the best possible self-adjusting binary search tree on that series of operations. One of the challenges we're having in trying to prove this is that no one actually knows how to determine the cost of the best possible BST on all inputs. Another is that finding upper bounds on the runtimes of various input combinations to splay trees is hard - as of now, no one knows whether it takes time O(n) to treat a splay tree as a deque!
Hope this helps!
I don't know if an attempt of an answer after more than five years is of any use to you, but sorry, I made my Master in CS only recently :-) In the wake of that, I played around exactly with your question.
Consider the sequence S(3,2) (it should be obvious how S(m,n) works generally if you graph it): S(3,2)=[5,13,6,14,3,15,4,16,1,17,2,18,11,19,12,20,9,21,10,22,7,23,8,24]. Splay is so lousy on this sequence type that the competive ratio r to the "Greedy Future" algorithm (see Demaine) is S[infty,infty]=2. I was never able to get over 2 even though Greedy Future is also not completely optimal and I could shave off a few operations.
(Legend: black,grey,blue: S(7,4); purple,orange,red: Splay must access these points too. Shown in the Demaine formulation.)
But note that your question is somewhat ill defined! If you ask for the absolutely worst sequence, then take e.g. the bit-reversal sequence, ANY tree algorithm needs O(n log n) for that. But if you ask for the competetive ratio r as implied in templatetypdef's answer, then indeed nobody knows (but I would make bets on r=2, see above).
Feel free to email me for details, I'm easily googled.
In case if I don't know the probabilities of accessing each element, but I'm sure that some elements will be accessed far more often then the others, I will use Splay tree. What should I use if I already know all the probabilities? I assume that there should be some data structure that is better than splay trees for this case.
I'm trying to imagine all the cases where and when should I use every type of the search trees. Maybe someone can post some links to articles about comparison of all the search trees, and similar structures?
EDIT I'd like to still have O(log n) as the worst case, but in avarage it should be faster. Splay trees are good example, but I'd like to predefine the configuration of this tree.
For example, I have an array of elements to store [a1, a2, .. an], and the probabilities for each element [p1, p2, .. pn], which define how often I will access each element. I can create splay tree, add each element to the splay tree (O(n log n)), and then access them with given probabilities to make the desired tree. So if I have probabilities [1/2, 1/4, 1/4], I need to splay the first element, to make it be among the first. So, I need to order elements by probabilities, and splay them in the order from the lowest to the highest access probability. That takes O(n log n) also. So, overall time of building such tree is O(n log n) with a big constant. My goal is to lower this number.
I do not mind using something else, but not search tree, but I'd like for the time to be lower then in case of Splay tree. And I want search, insert and delete be in the range of O(log n) amortized.
Edit: I didn't see that you wanted to update the tree dynamically - the below algorithm requires all elements and probabilities to be known in advance. I'll leave the post up in case someone in such a situation comes along.
If you happen to be in possession of the third edition of Introduction to Algorithms by Cormen et al., it describes a dynamic programming algorithm for creating optimal binary search trees when you know all of the probabilities.
Here is a rough outline of the algorithm: First, sort the elements (on element value, not probability). We don't yet know which element should be the root of the tree, but we know that all elements that will be to the left of the root in the tree will be to the left of that element in the list, and vice versa for the elements to the right of the root. If we choose the element at index k to be the root, we get two subproblems: how to construct an optimal tree for the elements 0 through k-1, and for the elements k+1 through n-1. Solve these problems recursively, so that you know the expected cost for a search in a tree where the root is element k. Do this for all possible choices of k, and you will find which tree is the best one. Use dynamic programming or memoization in order to save computation time.
Use a hash table.
You never mentioned needing ordered iteration, and by sacrificing this you can achieve amortized O(1) insert/access complexity, better than O(log n).
Specifically, use a hash table with linked list buckets, and use the move-to-front optimization. What this means is each time you search a bucket (linked list) with more than one item, you move the item found to the front of that bucket. The next time you access this element, it will already be at the front.
If you know the access probabilities, you can further refine the technique. When inserting a new element into a bucket, don't insert it onto the front, but rather insert such that you maintain most-probable-first order. Note the move-to-front technique will tend to perform this sort implicitly already, but you can help it bootstrap more quickly.
If your tree is not going to change once created, you probably should use a hash table or tango tree:
http://en.wikipedia.org/wiki/Tango_tree
Hash tables, when not overloaded, are O(1) lookup, degrading to a O(n) when overloaded.
Tango trees, once constructed, are O(loglogn) lookup. They do not support deletion or insertion.
There's also something known as a "perfect hash" that might be good for your use.
I have just finished a job interview and I was struggling with this question, which seems to me as a very hard question for giving on a 15 minutes interview.
The question was:
Write a function, which given a stream of integers (unordered), builds a balanced search tree.
Now, you can't wait for the input to end (it's a stream), so you need to balance the tree on the fly.
My first answer was to use a Red-Black tree, which of course does the job, but i have to assume they didn't expect me to implement a red black tree in 15 minutes.
So, is there any simple solution for this problem i'm not aware of?
Thanks,
Dave
I personally think that the best way to do this would be to go for a randomized binary search tree like a treap. This doesn't absolutely guarantee that the tree will be balanced, but with high probability the tree will have a good balance factor. A treap works by augmenting each element of the tree with a uniformly random number, then ensuring that the tree is a binary search tree with respect to the keys and a heap with respect to the uniform random values. Insertion into a treap is extremely easy:
Pick a random number to assign to the newly-added element.
Insert the element into the BST using standard BST insertion.
While the newly-inserted element's key is greater than the key of its parent, perform a tree rotation to bring the new element above its parent.
That last step is the only really hard one, but if you had some time to work it out on a whiteboard I'm pretty sure that you could implement this on-the-fly in an interview.
Another option that might work would be to use a splay tree. It's another type of fast BST that can be implemented assuming you have a standard BST insert function and the ability to do tree rotations. Importantly, splay trees are extremely fast in practice, and it's known that they are (to within a constant factor) at least as good as any other static binary search tree.
Depending on what's meant by "search tree," you could also consider storing the integers in some structure optimized for lookup of integers. For example, you could use a bitwise trie to store the integers, which supports lookup in time proportional to the number of bits in a machine word. This can be implemented quite nicely using a recursive function to look over the bits, and doesn't require any sort of rotations. If you needed to blast out an implementation in fifteen minutes, and if the interviewer allows you to deviate from the standard binary search trees, then this might be a great solution.
Hope this helps!
AA Trees are a bit simpler than Red-Black trees, but I couldn't implement one off the top of my head.
One of the simplest balanced binary search tree is BB(α)-tree. You pick the constant α, which says how much unbalanced can the tree get. At all times, #descendants(child) <= (1-α) × #descendants(node) must hold. You treat it as normal binary search tree, but when the formula doesn't apply to some node anymore, you just rebuild that part of the tree from scratch, so that it is perfectly balanced.
The amortized time complexity for insertion or deletion is still O(log N), just as with other balanced binary trees.
We always see operations on a (binary search) tree has O(logn) worst case running time because of the tree height is logn. I wonder if we are told that an algorithm has running time as a function of logn, e.g m + nlogn, can we conclude it must involve an (augmented) tree?
EDIT:
Thanks to your comments, I now realize divide-conquer and binary tree are so similar visually/conceptually. I had never made a connection between the two. But I think of a case where O(logn) is not a divide-conquer algo which involves a tree which has no property of a BST/AVL/red-black tree.
That's the disjoint set data structure with Find/Union operations, whose running time is O(N + MlogN), with N being the # of elements and M the number of Find operations.
Please let me know if I'm missing sth, but I cannot see how divide-conquer comes into play here. I just see in this (disjoint set) case that it has a tree with no BST property and a running time being a function of logN. So my question is about why/why not I can make a generalization from this case.
What you have is exactly backwards. O(lg N) generally means some sort of divide and conquer algorithm, and one common way of implementing divide and conquer is a binary tree. While binary trees are a substantial subset of all divide-and-conquer algorithms, the are a subset anyway.
In some cases, you can transform other divide and conquer algorithms fairly directly into binary trees (e.g. comments on another answer have already made an attempt at claiming a binary search is similar). Just for another obvious example, however, a multiway tree (e.g. a B-tree, B+ tree or B* tree), while clearly a tree is just as clearly not a binary tree.
Again, if you want to badly enough, you can stretch the point that a multiway tree can be represented as sort of a warped version of a binary tree. If you want to, you can probably stretch all the exceptions to the point of saying that all of them are (at least something like) binary trees. At least to me, however, all that does is make "binary tree" synonymous with "divide and conquer". In other words, all you accomplish is warping the vocabulary and essentially obliterating a term that's both distinct and useful.
No, you can also binary search a sorted array (for instance). But don't take my word for it http://en.wikipedia.org/wiki/Binary_search_algorithm
As a counter example:
given array 'a' with length 'n'
y = 0
for x = 0 to log(length(a))
y = y + 1
return y
The run time is O(log(n)), but no tree here!
Answer is no. Binary search of a sorted array is O(log(n)).
Algorithms taking logarithmic time are commonly found in operations on binary trees.
Examples of O(logn):
Finding an item in a sorted array with a binary search or a balanced search tree.
Look up a value in a sorted input array by bisection.
As O(log(n)) is only an upper bound also all O(1) algorithms like function (a, b) return a+b; satisfy the condition.
But I have to agree all Theta(log(n)) algorithms kinda look like tree algorithms or at least can be abstracted to a tree.
Short Answer:
Just because an algorithm has log(n) as part of its analysis does not mean that a tree is involved. For example, the following is a very simple algorithm that is O(log(n)
for(int i = 1; i < n; i = i * 2)
print "hello";
As you can see, no tree was involved. John, also provides a good example on how binary search can be done on a sorted array. These both take O(log(n)) time, and there are of other code examples that could be created or referenced. So don't make assumptions based on the asymptotic time complexity, look at the code to know for sure.
More On Trees:
Just because an algorithm involves "trees" doesn't imply O(logn) either. You need to know the tree type and how the operation affects the tree.
Some Examples:
Example 1)
Inserting or searching the following unbalanced tree would be O(n).
Example 2)
Inserting or search the following balanced trees would both by O(log(n)).
Balanced Binary Tree:
Balanced Tree of Degree 3:
Additional Comments
If the trees you are using don't have a way to "balance" than there is a good chance that your operations will be O(n) time not O(logn). If you use trees that are self balancing, then inserts normally take more time, as the balancing of the trees normally occur during the insert phase.
I have a set of double-precision data and I need their list to be always sorted. What is the best algorithm to sort the data as it is being added?
As best I mean least Big-O in data count, Small-O in data count (worst case scenario), and least Small-O in the space needed, in that order if possible.
The set size is really variable, from a small number (30) to lots of data (+10M).
Building a self-balancing binary tree like a red-black tree or AVL tree will allow for Θ(lg n) insertion and removal, and Θ(n) retrieval of all elements in sorted order (by doing a depth-first traversal), with Θ(n) memory usage. The implementation is somewhat complex, but they're efficient, and most languages will have library implementations, so they're a good first choice in most cases.
Additionally, retreiving the i-th element can be done by annotating each edge (or, equivalently, node) in the tree with the total number of nodes below it. Then one can find the i-th element in Θ(lg n) time and Θ(1) space with something like:
node *find_index(node *root, int i) {
while (node) {
if (i == root->left_count)
return root;
else if (i < root->left_count)
root = root->left;
else {
i -= root->left_count + 1;
root = root->right;
}
}
return NULL; // i > number of nodes
}
An implementation that supports this can be found in debian's libavl; unfortunately, the maintainer's site seems down, but it can be retrieved from debian's servers.
The structure that is used for indexes of database programs is a B+ Tree. It is a balanced bucketed n-ary tree.
From Wikipedia:
For a b-order B+ tree with h levels of index:
The maximum number of records stored is n = b^h
The minimum number of keys is 2(b/2)^(h−1)
The space required to store the tree is O(n)
Inserting a record requires O(log-b(n)) operations in the worst case
Finding a record requires O(log-b(n)) operations in the worst case
Removing a (previously located) record requires O(log-b(n)) operations in the worst case
Performing a range query with k elements occurring within the range requires O(log-b(n+k)) operations in the worst case.
I use this in my program. You can add your data to the structure as it comes and you can always traverse it in order, front to back or back to front, or search quickly for any value. If you don't find the value, you will have the insertion point where you can add the value.
You can optimize the structure for your program by playing around with b, the size of the buckets.
An interesting presentation about B+ trees: Tree-Structured Indexes
You can get the entire code in C++.
Edit: Now I see your comment that your requirement to know the "i-th sorted element in the set" is an important one. All of a sudden, that makes many data structures less than optimal.
You are probably best off with a SortedList or even better, a SortedDictionary. See the article: Squeezing more performance from SortedList. Both structures have a GetKey function that will return the i-th element.
Likely a heap sort. Heaps are only O(log N) to add new data, and you can pop off the net results at any time in O(N log N) time.
If you always need the whole list sorted every time, then there's not many other options than an insertion sort. It will likely be O(N^2) though with HUGE hassle of linked skip lists you can make it O(N log N).
I would use a heap/priority queue. Worst case is same as average case for runtime. Next element can be found in O(log n) time.
Here is a templatized C# implementation that I derived from this code.
If you just need to know the ith smallest element as it says in the comments, use the BFPRT algorithm which is named after the last names of the authors: Blum, Floyd, Pratt, Rivest, and Tarjan and is generally agreed to be the biggest concentration of big computer science brains in the same paper. O(n) worst-case.
Ok, you want you data sorted, but you need to extract it via an index number.
Start with a basic Tree such as the afforementioned Red-Black trees.
Modify the tree algo such that as you insert elements into the tree all nodes encountered during insertion and deletion keep a count of the number of elements under each branch.
Then when you are extracting data from the tree you can calculate the index as you go, and know which branch to take based on whether is greater or less than the index you are trying to extract.
One other consideration. 10M elements+ in a tree that uses dynamic memory allocation will suck up alot of memory overhead. i.e. The pointers may take up more space than your actual data, plus whatever other member is used to implement the data structure. This will lead to serious memory fragmentation, and in the worst cases, degrade the system's overall performance. (Churning data back and forth from virtual memory.) You might want to consider implementing a combination of block and dynamic memory allocation. Something where in you sort the tree into blocks of data, thus reducing the memory overhead.
Check out the comparison of sorting algorithms in Wikipedia.
Randomized Jumplists are interesting as well.
They require less space as BST and skiplists.
Insertion and deletion is O(log n)
By a "set of double data," do you mean a set of real-valued numbers? One of the more commonly used algorithms for that is a heap sort, I'd check that out. Most of its operations are O( n * log(n) ), which is pretty good but doesn't meet all of your criteria. The advantages of heapsort is that it's reasonably simple to code on your own, and many languages provide libraries to manage a sorted heap.