Splay tree: worst-case sequence - data-structures

I want to try trying executing the worst-case sequence on Splay tree.
But what is the worst-case sequence on Splay-trees?
And are there any way to calculate this sequence easily given the keys which is inserted into the tree?
Any can help me with this?

Unless someone corrects me, I'm going to go with "no one actually knows what the worst-case series of operations is on a splay tree or what the complexity is in that case."
While we do know many results about the efficiency of splay trees, we actually don't know all that much about how to bound the time complexity of a splay tree. There's a conjecture called the dynamic optimality conjecture that says that in the worst case, any sufficiently long series of operations on a splay tree will take no more than a constant amount of time more than the best possible self-adjusting binary search tree on that series of operations. One of the challenges we're having in trying to prove this is that no one actually knows how to determine the cost of the best possible BST on all inputs. Another is that finding upper bounds on the runtimes of various input combinations to splay trees is hard - as of now, no one knows whether it takes time O(n) to treat a splay tree as a deque!
Hope this helps!

I don't know if an attempt of an answer after more than five years is of any use to you, but sorry, I made my Master in CS only recently :-) In the wake of that, I played around exactly with your question.
Consider the sequence S(3,2) (it should be obvious how S(m,n) works generally if you graph it): S(3,2)=[5,13,6,14,3,15,4,16,1,17,2,18,11,19,12,20,9,21,10,22,7,23,8,24]. Splay is so lousy on this sequence type that the competive ratio r to the "Greedy Future" algorithm (see Demaine) is S[infty,infty]=2. I was never able to get over 2 even though Greedy Future is also not completely optimal and I could shave off a few operations.
(Legend: black,grey,blue: S(7,4); purple,orange,red: Splay must access these points too. Shown in the Demaine formulation.)
But note that your question is somewhat ill defined! If you ask for the absolutely worst sequence, then take e.g. the bit-reversal sequence, ANY tree algorithm needs O(n log n) for that. But if you ask for the competetive ratio r as implied in templatetypdef's answer, then indeed nobody knows (but I would make bets on r=2, see above).
Feel free to email me for details, I'm easily googled.

Related

How to reduce the auxiliary memory of the below two binary tree related problems : [ grand parent and uncle related problems ]

I was asked to solve a binary tree traversal related problem recently where the aim is to find sum of all nodes in a binary tree where node is odd and its uncle is also odd. I came with a solution as below which is O(n) in algorithmic complexity ( 1 time full traversal of the tree ) and auxillary memory usage which is equal to O(h). If and only if the binary tree happends to be BST and height balanced then it can be argued that the auxillary memory complexity will be O(log(n)).
My solution is a variation on the path identification of all root to leaf problem. This problem and its solution can be found here.
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set2/rootleaf.cpp
The solution to the odd node with odd uncle is given here.
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set5/sumodduncle.cpp
The interviewer agreed that the algorithmic complexity is obvious as one traversal is definitely needed and it is O(n). But he argued that the auxiliary memory complexity can be designed much better than O(h) and he did not tell what the approach was. I have been thinking about this for 2 weeks now and haven't got a better solution yet.
I cleared the interview btw and was offered a role that I am considering now, but I still don't know what the better approach to auxiliary memory tuning is here. Can it be O(1) sounds not possible until somehow we keep track at every node only the parent and grandparent which is then O(1).is that possible?
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set5/sumodduncle.cpp In this code module the solution using the below invocation...
long sumAlt=findSumOddUncle(uniqueBstPtr);
Is the O(1) solution because all variables are passed via pointer and only the sum is passed which accumulates the total in recursive calls. Tested and works as expected.

Time complexity for greedily coded Huffman tree

Edited to clarify the question
I'm about to turn in a laboratory project at Uni on two well known compression algorithms (Huffman coding and Lempel-Ziv 77). My implementation of Huffman coding is similar to the greedy approach, in which the tree is built with the following steps:
1. Calculate frequencies for all unique characters and place them in a minimum heap
2. While there are more than two nodes in heap:
2.1 Take a value from the minimum heap and place it as the left child
2.2 Take a value from the minimum heap and place it as the right child
2.3 Create a parent node with a frequency of 'frequency of the left child + frequency
of the right child'
2.4 Place the parent node to the minimum heap and continue
3. Finalize the tree with a root node (children are the last two nodes in the minimum heap)
I've been able to find good sources on the time complexities of all other steps for both algorithms, except for the decompression phase of Huffman coding. My current understanding is that even though this implementation of Huffman tree is unbalanced and in the worst case has the longest path of length n-1 (in which n equals the count of unique characters), the traversal times are balanced by the frequencies of different characters.
In an unbalanced Huffman tree the leaf nodes with higher frequencies are more common and have a shorter traversal paths. This balances the total time cost and my intuition states that the total time would approach a time complexity of O(k log n), in which k is the length of the uncompressed content and n the unique characters in the Huffman tree.
I'd feel much more comfortable if I had reliable source to reference regarding the time complexity of the decompression phase that either support my intuition or counters it. If anyone has good tips on books or not-super-difficult-to-read-articles that cover this particular question, I'd very much appreciate the help! I've put quite a lot of effort into my project and I don't think this one detail is going to make a big dent, even if I don't find the answer in time. I'm mainly just super-fascinated by the topic and want to learn as much as possible.
And just in case anyone ever stumbles into this question on a similar question, here's my project. Keep in mind that it's a student project and it's good to keep a critical mind while reviewing it.
Generating a Huffman code is O(n) for n symbols being coded, when the frequencies are sorted. The sorting in general takes O(n log n), so that is the time complexity for the Huffman algorithm.
The priority queue in your example is one way to do the sorting. There are other approaches where the sorting is done before doing Huffman coding.
Generally, LZ77 and Huffman coding are combined.

AVL tree management

I have some question about AVL, lets assume I created some avl-tree of integers, how do I need to manage insertion into my tree to be able to take out the longest sequence of numbers, (insertion have to be with complexity O(logn)), for example:
_ 10 _
_ 7 _ _ 12 _
6 8
in this case the longest sequence will be 6,7,8 so in my function void sequence(int* low, int* high) I'll do *low = 6, *high = 8...
comlexity of the function(sequence) have to be O(1)
thanks in advance for any idea
Actually, if you build an interval list or something very-like-it, then store the components of it in an AVL tree, you could probably do okay. The thing is that you don't just want a given sequence, you want longest sequence. Longest run of lexically immediately adjacent keys*, to be more exact. Which, curiously, is quite hard I think without bashing up a custom metric to build your AVL tree on. I guess if your comparator for the AVL tree built on the interval list was f(length-of-interval), you could get it in o(logn) or maybe faster if your AVL implementation has fast max\min.
I'm terribly sorry, I was hoping to be more help, but the fact that we have to use an AVL tree is a little troubling. I'm wondering if there's a trick that one could pull involving sub-trees, but I'm simply seeing no good way to make such an approach o(1) without so much preprocessing as to be a joke. Something with bloom filters might work?
*
Some total orderings can create similar runs, but not all have a meaningful concept of immediate adjacency in their... well... phase space I guess?**
**My lackluster formal education is really biting me right about now.
The basic insertion & rotation in AVL Trees guarantees close to O(logn) performance.
Coming to the 2nd part of your question, to find the "complexity" of your sequence, you first need to find (or traverse to) the "low" element in your AVL Tree, that itself will take you AT MOST O(logn).
So O(1) sequence() complexity is out of the window... If O(1) is a must then maybe AVL tree is not your data structure here.

Using red black trees for sorting

The worst-case running time of insertion on a red-black tree is O(lg n) and if I perform a in-order walk on the tree, I essentially visit each node, so the total worst-case runtime to print the sorted collection would be O(n lg n)
I am curious, why are red-black trees not preferred for sorting over quick sort (whose average-case running time is O(n lg n).
I see that maybe because red-black trees do not sort in-place, but I am not sure, so maybe someone could help.
Knowing which sort algorithm performs better really depend on your data and situation.
If you are talking in general/practical terms,
Quicksort (the one where you select the pivot randomly/just pick one fixed, making worst case Omega(n^2)) might be better than Red-Black Trees because (not necessarily in order of importance)
Quicksort is in-place. The keeps your memory footprint low. Say this quicksort routine was part of a program which deals with a lot of data. If you kept using large amounts of memory, your OS could start swapping your process memory and trash your perf.
Quicksort memory accesses are localized. This plays well with the caching/swapping.
Quicksort can be easily parallelized (probably more relevant these days).
If you were to try and optimize binary tree sorting (using binary tree without balancing) by using an array instead, you will end up doing something like Quicksort!
Red-Black trees have memory overheads. You have to allocate nodes possibly multiple times, your memory requirements with trees is doubles/triple that using arrays.
After sorting, say you wanted the 1045th (say) element, you will need to maintain order statistics in your tree (extra memory cost because of this) and you will have O(logn) access time!
Red-black trees have overheads just to access the next element (pointer lookups)
Red-black trees do not play well with the cache and the pointer accesses could induce more swapping.
Rotation in red-black trees will increase the constant factor in the O(nlogn).
Perhaps the most important reason (but not valid if you have lib etc available), Quicksort is very simple to understand and implement. Even a school kid can understand it!
I would say you try to measure both implementations and see what happens!
Also, Bob Sedgewick did a thesis on quicksort! Might be worth reading.
There are plenty of sorting algorithms which are worst case O(n log n) - for example, merge sort. The reason quicksort is preferred is because it is faster in practice, even though algorithmically it may not be as good as some other algorithms.
Often in-built sorts use a combination of various methods depending on the values of n.
There are many cases where red-back trees are not bad for sorting. My testing showed, compared to natural merge sort, that red-black trees excel where:
Trees are better for Dups:
All the tests where dups need to be eleminated, tree algorithm is better. This is not astonishing, since the tree can be kept very small from the beginning, whereby algorithms that are designed for inline array sort might pass around larger segments for a longer time.
Trees are better for Random:
All the tests with random, tree algorithm is better. This is also not astonishing, since in a tree distance between elements is shorter and shifting is not necessary. So repeatedly inserting into a tree could need less effort than sorting an array.
So we get the impression that the natural merge sort only excels in ascending and descending special cases. Which cant be even said for quick sort.
Gist with the test cases here.
P.S.: it should be noted that using trees for sorting is non-trivial. One has not only to provide an insert routine but also a routine that can linearize the tree back to an array. We are currently using a get_last and a predecessor routine, which doesn't need a stack. But these routines are not O(1) since they contain loops.
Big-O time complexity measures do not usually take into account scalar factors, e.g., O(2n) and O(4n) are usually just reduced to O(n). Time complexity analysis is based on operational steps at an algorithmic level, not at a strict programming level, i.e., no source code or native machine instruction considerations.
Quicksort is generally faster than tree-based sorting since (1) the methods have the same algorithmic average time complexity, and (2) lookup and swapping operations require fewer program commands and data accesses when working with simple arrays than with red-black trees, even if the tree uses an underlying array-based implementation. Maintenance of the red-black tree constraints requires additional operational steps, data field value storage/access (node colors), etc than the simple array partition-exchange steps of a quicksort.
The net result is that red-black trees have higher scalar coefficients than quicksort does that are being obscured by the standard O(n log n) average time complexity analysis result.
Some other practical considerations related to machine architectures are briefly discussed in the Quicksort article on Wikipedia
Generally, representations of O(nlgn) algorithms can be expanded to A*nlgn + B where A and B are constants. There are many algorithmic proofs that show the coefficients for quicksort are smaller than those of other algorithms. That is in best-case (quick sort performs horribly on sorted data).
Hi the best way to explain the difference between all sorting routine in my opinion is.
(My answer is for people who are confused how quick sort is faster in practice than another sorting algo).
"Think u are running on a very slow computer".
First thing one comparing operation takes 1 hour.
One shifting operation takes 2 hours.
"I am using hour just to make people understand how important time is".
Now from all the sorting operations quick-sort have very very less comparisons and very less swapping for elements.
Quick-sort is faster for this main reason.

How do I find out out the fundamental operation when calculating run-time complexity?

I am trying to get the worst run-time complexity order on a couple of algorithms created. However I have run into a problem that I keep tending to select the wrong or wrong amount of fundamental operations for an algorithm.
To me it appears to be that the selection of the fundamental operation is more of an art than a science. After googling and reading my text boxes, I still have not found a good definition. So far I have defined it as "An operation that always occurs within an algorithms execution" such as a comparison or array manipulation.
But algorithms often have many comparisons that are always executed so which operation do you pick?
I agree to some degree it's an art, so you should always clarify when writing documentation, etc.. But usually it's a "visit" to the underlying data structure. So like you said, for an array it's a comparison or a swap, for a hash map it may be a manual examination of a key, for a graph it's a visit to a vertex or edge, etc.
Even practicing complexity theorists have disagreements about this sort of thing, so what follows may be a bit subjective: http://blog.computationalcomplexity.org/2009/05/shaving-logs-with-unit-cost.html
The purpose of big-O notation is to summarize the efficiency of an algorithm for the reader. In practical contexts, I am most concerned with how many clock cycles an algorithm takes, assuming that the big-O constant is neither extremely small or large (and ignoring the effects of the memory hierarchy); this is the "unit-cost" model alluded to in the linked post.
The reason to count comparisons for sorting algorithms is that the cost of a comparison depends on the type of the input data. You could say that a sorting algorithm takes O(c n log n) cycles where c is the expense of a comparison, but it's simpler in this case to count comparisons instead because the other work performed by the algorithm is O(n log n). There's a sorting algorithm that sorts the concatenation of n sorted arrays of length n in n^2 log n steps and n^2 comparisons; here, I would expect that the number of comparisons and the computational overhead be stated separately, because neither necessarily dominates the other.
This only works when You have actually implemented the algorithm, but You could just use a profiler to see which operation is the bottleneck. That's a practical point of view. In theory, some assume that everything that is not the fundamental operation runs in zero time.
The somewhat simple definition I have heard is:
The operation which is executed at least as many times as any other
operation in the algorithm.
For example, in a sorting algorithm, these tend to be comparisons rather than assignments as you almost always have to visit and 'check' an element before you re-order it, but the check may not result in a re-ordering. So there will always be at-least as many comparisons as assignments.

Resources