Intuition behind splay tree (self balancing trees) - algorithm

I am reading the basics of splay trees. The amortized cost of an operation is O(log n) over n operations. Some rough basic idea is that when you access a node, you splay it i.e. you take it to root so next time this is quickly accessed and also if the node was deep, it enhances balance-ness of tree.
I don't understand how the tree can perform amortized O(log n) for this sample input:
Say a tree of n nodes is already built. My next n operations are n reads. I access a deep node say at depth n. This takes O(n). True that after this access, the tree will become balanced. But say every time I access the most current deep node. This will never be less than O(log n). then how we can ever compensate for the first costly O(n) operation and bring the amortized cost of each read as O(log n)?
Thanks.

Assuming your analysis is correct and the operations are O(log(n)) per access and O(n) the first time...
If you always access the bottommost element (using some kind of worst-case oracle), a sequence of a accesses will take O(a*log(n) + n). And thus the amortized cost per operation is O((a*log(n) + n)/a)=O(log(n) + n/a) or just O(log(n)) as the number of accesses grows large.
This is the definition of asymptotic average-case performance/time/space, also called "amortized performance/time/space". You are accidentally thinking that a single O(n) step means all steps are at least O(n); one such step is only a constant amount of work in the long run; the O(...) is hiding what's really going on, which is taking the limit of [total amount of work]/[queries]=[average ("amortized") work per query].
This will never be less than O(log n).
It has to be in order to get O(log n) average performance. To get intuition, the following website may be good: http://users.informatik.uni-halle.de/~jopsi/dinf504/chap4.shtml specifically the image http://users.informatik.uni-halle.de/~jopsi/dinf504/splay_example.gif -- it seems that while performing the O(n) operations, you move the path you searched scrunching it towards the top of the tree. You probably only have a finite number of such O(n) operations to perform until the entire tree is balanced.
Here's another way to think about it:
Consider an unbalanced binary search tree. You can spend O(n) time balancing it. Assuming you don't add elements to it*, it takes O(log(n)) amortized time per query to fetch an element. The balancing setup cost is included in the amortized cost because it is effectively a constant which, as demonstrated in the equations in the answer, disappears (is dwarfed) by the infinite amount of work you are doing. (*if you do add elements to it, you need a self-balancing binary search tree, one of which is a splay tree)

Related

Amortized cost of insert/remove on min-heap

I ran into an interview question recently. no additional info is given into question (maybe default implementation should be used...)
n arbitrary sequences of insert and remove operations on empty min heap
(location for delete element is known) has amortized cost of:
A) insert O(1), remove O(log n)
B) insert O(log n), remove O(1)
The option (B) is correct.
I'm surprized when see answer sheet. i know this is tricky, maybe empty heap, maybe knowing location of elements for delete,... i dont know why (A) is false? Why (B) is true?
When assigning amortized costs to operations on a data structure, you need to ensure that, for any sequence of operations performed, that the sum of the amortized costs is always at least as big as the sum of the actual costs of those operations.
So let's take Option 1, which assigns an amortized cost of O(1) to insertions and an amortized cost of O(log n) to deletions. The question we have to ask is the following: is it true that for any sequence of operations on an empty binary heap, the real cost of those operations is upper-bounded by the amortized cost of those operations? And in this case, the answer is no. Imagine that you do a sequence purely of n insertions into the heap. The actual cost of performing these operations can be Θ(n log n) if each element has to bubble all the way up to the top of the heap. However, the amortized cost of those operations, with this accounting scheme, would be O(n), since we did n operations and pretended that each one cost O(1) time. Therefore, this amortized accounting scheme doesn't work, since it will let us underestimate the work that we're doing.
On the other hand, let's look at Option 2, where we assign O(log n) as our amortized insertion cost and O(1) as our amortized remove cost. Now, can we find a sequence of n operations where the real cost of those operations exceeds the amortized costs? In this case, the answer is no. Here's one way to see this. We've set the amortized cost of an insertion to be O(log n), which matches its real cost, and so the only way that we could end up underestimating the total is with our amortized cost of a deletion (O(1)), which is lower than the true cost of a deletion. However, that's not a problem here. In order for us to be able to do a delete operation, we have to have previously inserted the element that we're deleting. The combined real cost of the insertion and the deletion is O(log n) + O(log n) = O(log n), and the combined amortized cost of the insertion and the deletion is O(log n) + O(1) = O(log n). So in that sense, pretending that deletions are faster doesn't change our overall cost.
A nice intuitive way to see why the second approach works but the first one doesn't is to think about what amortized analysis is all about. The intuition behind amortization is to charge earlier operations a bit more so that future operations appear to take less time. In the case of the second accounting scheme, that's exactly what we're doing: we're shifting the cost of the deletion of an element from the binary heap back onto the cost of inserting that element into the heap in the first place. In that way, since we're only shifting work backwards, the sum of the amortized costs can't be lower than the sum of the real costs. On the other hand, in the first case, we're shifting work forward in time by making deletions pay for insertions. But that's a problem, because if we do a bunch of insertions and then never do the corresponding deletions we'll have shifted the work to operations that don't exist.
Because the heap is initially empty, you can't have more deletes than inserts.
An amortized cost of O(1) per deletion and O(log N) per insertion is exactly the same as an amortized cost of O(log N) for both inserts and deletes, because you can just count the deletion cost when you do the corresponding insert.
It does not work the other way around. Since you can have more inserts than deletes, there might not be enough deletes to pay the cost of each insert.

Running time complexity for binary search tree

I already know if you try to find the item with particular key the running time of worst case
is O(n) ,nis the number of node. If you try to print out all the data items in order of their keys then the running time of worst case is O(n). If you try to search for a particular data item(you don't know the key) then the running time of worst case is O(n). However, what if the keys and data are both integers and, the input items were randomly scrambled before they were inserted. Will the worst cases of running time still the same?
In the worst-case, yes. A randomly-built BST with n nodes has a 2n-1 / n! chance of being built degenerately, which is extremely rare as n gets to any reasonable size but still possible. In that case, a lookup might take Θ(n) time because the search might need to descend all the way down to the deepest leaf.
On expectation, though, the tree height will be Θ(log n), so lookups will take expected O(log n) time.
The time to print a tree is independent of the shape of the tree, by the way. It's always Θ(n).
Hope this helps!
You might not be able to change the worst case running time of a normal BST, however, if you randomize the input(in less than O(log n) time, if you're targeting O(log n) overall), then chances of that worst case occurring are highly rare. See mathematical analysis here.
In case you are interested in guaranteed O(log n) time, you can use Balanced BSTs like Red Black Trees etc. However, time to print will still be O(n) as you still need to visit each and every node before you can print it.

LSM Tree lookup time

What's the worst case time complexity in a log-structured merge tree for a simple search query (like querying a single WHERE clause)?
Is it O(log N)? O(N*Log N)? Something else?
How about for a multiple query, like searching for multiple WHERE clauses in a key-value database?
The wikipedia page on LSM trees is currently lacking this info.
And I'm trying to make sense of the original paper.
I have been wondering the same.
If you have a series of trees, getting smaller by a constant factor each time, and you need to search them all for a single key, the cost seems to be O(log(N)^2).
Say the first (binary) tree takes log_2(N) branches to reach a node. The second might be half the size, and take (log_2(N) - 1) branches to find a node. The smallest tree will be some O(1) constant in size and there are roughly log_2(N) trees total. Summing the series gives O(log_2(N)^2).
However, I'm wondering if there is some more clever scheme where arbitrary single-key lookups, insertions or deletions have amortized cost O(log(N)), but haven't been able to find an answer (yet).
For a simple search indexed by a LSM tree, it is O(log n). This is because the biggest tree in the LSM tree is a B tree, which is O(log n), and the other trees are subsets of B trees or in the case of in memory trees, more efficient trees, which are no worse than O(log n). The number of trees is a constant, so it doesn't affect the order of the search time.

Implementation of priority queue by AVL Tree data structure

Priority queue:
Basic operations: Insertion
Delete (Delete minumum element)
Goal: To provide efficient running time or order of growth for above functionality.
Implementation of Priority queue By:
Linked List: Insertion will take o(n) in case of insertion at end o(1) in case of
insertion at head.
Delet (Finding minumum and Delete this ) will take o(n)
BST:
Insertion/Deltion of minimum = In avg case it will take o(logn) worst case 0(n)
AVL Tree:
Insertion/deletion/searching: o(log n) in all cases.
My confusion goes here:
Why not we have used AVL Tree for implementation of Priority queue, Why we gone
for Binary heap...While as we know that in AVL Tree we can do insertion/ Deletion/searching in o(log n) in worst case.
Complexity isn't everything, there are other considerations for actual performance.
For most purposes, most people don't even use an AVL tree as a balanced tree (Red-Black trees are more common as far as I've seen), let alone as a priority queue.
This is not to say that AVL trees are useless, I quite like them. But they do have a relatively expensive insert. What AVL trees are good for (beating even Red-Black trees) is doing lots and lots of lookups without modification. This is not what you need for a priority queue.
As a separate consideration -- never mind your O(log n) insert for a binary heap, a fibonacci heap has O(1) insert and O(log N) delete-minimum. There are a lot of data structures to choose from with slightly different trade-offs, so you wouldn't expect to see everyone just pick the first thing that satisfies your (quite brief) criteria.
Binary heap is not Binary Search Tree (BST). If severely unbalanced / deteriorated into a list, it will indeed take O(n) time. Heaps are usually always O(log(n)) or better. IIRC Sedgewick claimed O(1) average-time for array-based heaps.
Why not AVL? Because it maintains too much order in a structure. Too much order means, too much effort went into maintaining that order. The less order we can get away with, the better - it will usually translate to faster operations. For example, RBTs are better than AVL trees. RBTs, red-black trees, are almost balanced trees - they save operations while still ensuring O(log(n)) time.
But any tree is totally-ordered structure, so heaps are generally better, because they only ensure that the minimal element is on top. They are only partially ordered.
Because in a binary heap the minimum element is the root.

Is there one type of set-like data structure supporting merging in O(logn) time and k-th search in O(logn) time?(n is the size of this set)

Is there one type of set-like data structure supporting merging in O(logn) time and k-th element search in O(logn) time? n is the size of this set.
You might try a Fibonacci heap which does merge in constant amortized time and decrease key in constant amortized time. Most of the time, such a heap is used for operations where you are repeatedly pulling the minimum value, so a check-for-membership function isn't implemented. However, it is simple enough to add one using the decrease key logic, and simply removing the decrease portion.
If k is a constant, then any meldable heap will do this, including leftist heaps, skew heaps, pairing heaps and Fibonacci heaps. Both merging and getting the first element in these structures typically take O(1) or O(lg n) amortized time, so O( k lg n) maximum.
Note, however, that getting to the k'th element may be destructive in the sense that the first k-1 items may have to be removed from the heap.
If you're willing to accept amortization, you could achieve the desired bounds of O(lg n) time for both meld and search by using a binary search tree to represent each set. Melding two trees of size m and n together requires time O(m log(n / m)) where m < n. If you use amortized analysis and charge the cost of the merge to the elements of the smaller set, at most O(lg n) is charged to each element over the course of all of the operations. Selecting the kth element of each set takes O(lg n) time as well.
I think you could also use a collection of sorted arrays to represent each set, but the amortization argument is a little trickier.
As stated in the other answers, you can use heaps, but getting O(lg n) for both meld and select requires some work.
Finger trees can do this and some more operations:
http://en.wikipedia.org/wiki/Finger_tree
There may be something even better if you are not restricted to purely functional data structures (i.e. aka "persistent", where by this is meant not "backed up on non-volatile disk storage", but "all previous 'versions' of the data structure are available even after 'adding' additional elements").

Resources