LSM Tree lookup time - big-o

What's the worst case time complexity in a log-structured merge tree for a simple search query (like querying a single WHERE clause)?
Is it O(log N)? O(N*Log N)? Something else?
How about for a multiple query, like searching for multiple WHERE clauses in a key-value database?
The wikipedia page on LSM trees is currently lacking this info.
And I'm trying to make sense of the original paper.

I have been wondering the same.
If you have a series of trees, getting smaller by a constant factor each time, and you need to search them all for a single key, the cost seems to be O(log(N)^2).
Say the first (binary) tree takes log_2(N) branches to reach a node. The second might be half the size, and take (log_2(N) - 1) branches to find a node. The smallest tree will be some O(1) constant in size and there are roughly log_2(N) trees total. Summing the series gives O(log_2(N)^2).
However, I'm wondering if there is some more clever scheme where arbitrary single-key lookups, insertions or deletions have amortized cost O(log(N)), but haven't been able to find an answer (yet).

For a simple search indexed by a LSM tree, it is O(log n). This is because the biggest tree in the LSM tree is a B tree, which is O(log n), and the other trees are subsets of B trees or in the case of in memory trees, more efficient trees, which are no worse than O(log n). The number of trees is a constant, so it doesn't affect the order of the search time.

Related

Best way to join two red-black trees

The easiest way is to store two trees in two arrays, merge them and build a new red-black tree with a sorted array which takes O(m + n) times.
Is there an algorithm with less time complexity?
You can merge two red-black trees in time O(m log(n/m + 1)) where n and m are the input sizes and, WLOG, m ≤ n. Notice that this bound is tighter than O(m+n). Here's some intuition:
When the two trees are similar in size (m ≈ n), the bound is approximately O(m) = O(n) = O(n + m).
When one tree is significantly larger than the other (m ≪ n), the bound is approximately O(log n).
You can find a brief description of the algorithm here. A more in-depth description which generalizes to other balancing schemes (AVL, BB[α], Treap, ...) can be found in a recent paper.
I think that if you have a generic Sets (so generic red-black tree) you can't choose the solution which was suggested #Sam Westrick. Because he assumes that all elements in the first set are less then the elements in the second set. Also into the Cormen (the best book to learn algorithm and data structures) specifies this condition to join two red-black tree.
Due to the fact that you need to compare each element in both m and n Red-Black Trees, you will have to deal with a minimum of O(m+n) time complexity, there's a way to do it O(1) space complexity, but that is something else which has nothing to do with your qu. if you are not iterating and checking each element in each Red-Black Tree, you cannot guarantee that your new Red-Black Tree will be sorted. I can think of another way of merging two Red-Black Trees, which called "In-Place Merge using DLL", but this one would also result O(m+n) time complexity.
Convert the given two Red-Black Trees into Doubly Linked List, which has O(m+n) time complexity.
Merge the two sorted Linked Lists, which has O(m+n) time complexity.
Build a Balanced Red-Black Tree from the merged list created in step 2, which has O(m+n) time complexity.
Time complexity of this method is also O(m+n).
So due to the fact you have to compare the elements each Tree with the other elements of the other Tree, you will have to end up with at least O(m+n).

Time complexity of binary search in a slightly unbalanced binary tree

The best case running time for binary search is O(log(n)), if the binary tree is balanced. The worst case would be, if the binary tree is so unbalanced, that it basically represents a linked list. In that case the running time of a binary search would be O(n).
However, what if the tree is only slightly unbalanced, as is teh case for this tree:
Best case would still be O(log n) if I am not mistaken. But what would be the worst case?
Typically, when we say something like "the cost of looking up an element in a balanced binary search tree is O(log n)," what we mean is "in the worst case, we have to do O(log n) work in the course of performing a search on a balanced binary search tree." And since we're talking about big-O notation here, the previous statement is meant to be taken about balanced trees in general rather than a specific concrete tree.
If you have a specific BST in mind, you can work out the maximum number of comparisons required to find any element. Just find the deepest node in the tree, then imagine searching for a value that's bigger than that value but smaller than the next value in the tree. That will cause you to walk all the way down the tree as deeply as possible, making the maximum number of comparisons possible (specifically, h + 1 of them, where h is the height of the tree).
To be able to talk about the big-O cost of performing lookups in a tree, you'd need to talk about a family of trees of different numbers of nodes. You could imagine "kinda balanced" trees whose depth is Θ(√n), for example, where lookups would take time O(√n), for example. However, it's uncommon to encounter trees like that in practice, since generally you'd either (1) have a totally imbalanced tree or (2) use some sort of balanced tree that would prevent the height from getting that high.
In a sorted array of n values, the run-time of binary search for a value, is
O(log n), in the worst case. In the best case, the element you are searching for, is in the exact middle, and it can finish up in constant-time. In the average case too, the run-time is O(log n).

Time complexity in BST

In what situation, searching a term using a binary search tree requires a time complexity that is linear to the size of the term vocabulary (say M)? How to ensure a worst time complexity of log M?
A complete binary tree is one for which every level, except possibly the last, is completely filled. The worst case search peformance is the height of the tree, which in this case would be O(lgM), assuming M vocabulay terms in the tree.
One way to ensure this performance would be to use a self-balancing tree, e.g. a red-black tree.
Since binary search is a divide-and-conquer algorithm, we can ensure O(log M) if the tree is balanced, with equal number of terms under the sub-trees of any node. O(log M) basically means that time goes up linearly while M goes up exponentially. If it takes 1 second to search a balanced binary tree that is 10 nodes, it’d take 2 seconds to search an equally balanced tree with 100 nodes, 3 seconds to search 1000 nodes, and so on.
But if the binary search tree is extremely unbalanced to the point where it looks a lot like a linked list, we would have to go through every node, requiring a time complexity that is linear to M.

Implementation of priority queue by AVL Tree data structure

Priority queue:
Basic operations: Insertion
Delete (Delete minumum element)
Goal: To provide efficient running time or order of growth for above functionality.
Implementation of Priority queue By:
Linked List: Insertion will take o(n) in case of insertion at end o(1) in case of
insertion at head.
Delet (Finding minumum and Delete this ) will take o(n)
BST:
Insertion/Deltion of minimum = In avg case it will take o(logn) worst case 0(n)
AVL Tree:
Insertion/deletion/searching: o(log n) in all cases.
My confusion goes here:
Why not we have used AVL Tree for implementation of Priority queue, Why we gone
for Binary heap...While as we know that in AVL Tree we can do insertion/ Deletion/searching in o(log n) in worst case.
Complexity isn't everything, there are other considerations for actual performance.
For most purposes, most people don't even use an AVL tree as a balanced tree (Red-Black trees are more common as far as I've seen), let alone as a priority queue.
This is not to say that AVL trees are useless, I quite like them. But they do have a relatively expensive insert. What AVL trees are good for (beating even Red-Black trees) is doing lots and lots of lookups without modification. This is not what you need for a priority queue.
As a separate consideration -- never mind your O(log n) insert for a binary heap, a fibonacci heap has O(1) insert and O(log N) delete-minimum. There are a lot of data structures to choose from with slightly different trade-offs, so you wouldn't expect to see everyone just pick the first thing that satisfies your (quite brief) criteria.
Binary heap is not Binary Search Tree (BST). If severely unbalanced / deteriorated into a list, it will indeed take O(n) time. Heaps are usually always O(log(n)) or better. IIRC Sedgewick claimed O(1) average-time for array-based heaps.
Why not AVL? Because it maintains too much order in a structure. Too much order means, too much effort went into maintaining that order. The less order we can get away with, the better - it will usually translate to faster operations. For example, RBTs are better than AVL trees. RBTs, red-black trees, are almost balanced trees - they save operations while still ensuring O(log(n)) time.
But any tree is totally-ordered structure, so heaps are generally better, because they only ensure that the minimal element is on top. They are only partially ordered.
Because in a binary heap the minimum element is the root.

Intuition behind splay tree (self balancing trees)

I am reading the basics of splay trees. The amortized cost of an operation is O(log n) over n operations. Some rough basic idea is that when you access a node, you splay it i.e. you take it to root so next time this is quickly accessed and also if the node was deep, it enhances balance-ness of tree.
I don't understand how the tree can perform amortized O(log n) for this sample input:
Say a tree of n nodes is already built. My next n operations are n reads. I access a deep node say at depth n. This takes O(n). True that after this access, the tree will become balanced. But say every time I access the most current deep node. This will never be less than O(log n). then how we can ever compensate for the first costly O(n) operation and bring the amortized cost of each read as O(log n)?
Thanks.
Assuming your analysis is correct and the operations are O(log(n)) per access and O(n) the first time...
If you always access the bottommost element (using some kind of worst-case oracle), a sequence of a accesses will take O(a*log(n) + n). And thus the amortized cost per operation is O((a*log(n) + n)/a)=O(log(n) + n/a) or just O(log(n)) as the number of accesses grows large.
This is the definition of asymptotic average-case performance/time/space, also called "amortized performance/time/space". You are accidentally thinking that a single O(n) step means all steps are at least O(n); one such step is only a constant amount of work in the long run; the O(...) is hiding what's really going on, which is taking the limit of [total amount of work]/[queries]=[average ("amortized") work per query].
This will never be less than O(log n).
It has to be in order to get O(log n) average performance. To get intuition, the following website may be good: http://users.informatik.uni-halle.de/~jopsi/dinf504/chap4.shtml specifically the image http://users.informatik.uni-halle.de/~jopsi/dinf504/splay_example.gif -- it seems that while performing the O(n) operations, you move the path you searched scrunching it towards the top of the tree. You probably only have a finite number of such O(n) operations to perform until the entire tree is balanced.
Here's another way to think about it:
Consider an unbalanced binary search tree. You can spend O(n) time balancing it. Assuming you don't add elements to it*, it takes O(log(n)) amortized time per query to fetch an element. The balancing setup cost is included in the amortized cost because it is effectively a constant which, as demonstrated in the equations in the answer, disappears (is dwarfed) by the infinite amount of work you are doing. (*if you do add elements to it, you need a self-balancing binary search tree, one of which is a splay tree)

Resources