Time complexity for greedily coded Huffman tree - algorithm

Edited to clarify the question
I'm about to turn in a laboratory project at Uni on two well known compression algorithms (Huffman coding and Lempel-Ziv 77). My implementation of Huffman coding is similar to the greedy approach, in which the tree is built with the following steps:
1. Calculate frequencies for all unique characters and place them in a minimum heap
2. While there are more than two nodes in heap:
2.1 Take a value from the minimum heap and place it as the left child
2.2 Take a value from the minimum heap and place it as the right child
2.3 Create a parent node with a frequency of 'frequency of the left child + frequency
of the right child'
2.4 Place the parent node to the minimum heap and continue
3. Finalize the tree with a root node (children are the last two nodes in the minimum heap)
I've been able to find good sources on the time complexities of all other steps for both algorithms, except for the decompression phase of Huffman coding. My current understanding is that even though this implementation of Huffman tree is unbalanced and in the worst case has the longest path of length n-1 (in which n equals the count of unique characters), the traversal times are balanced by the frequencies of different characters.
In an unbalanced Huffman tree the leaf nodes with higher frequencies are more common and have a shorter traversal paths. This balances the total time cost and my intuition states that the total time would approach a time complexity of O(k log n), in which k is the length of the uncompressed content and n the unique characters in the Huffman tree.
I'd feel much more comfortable if I had reliable source to reference regarding the time complexity of the decompression phase that either support my intuition or counters it. If anyone has good tips on books or not-super-difficult-to-read-articles that cover this particular question, I'd very much appreciate the help! I've put quite a lot of effort into my project and I don't think this one detail is going to make a big dent, even if I don't find the answer in time. I'm mainly just super-fascinated by the topic and want to learn as much as possible.
And just in case anyone ever stumbles into this question on a similar question, here's my project. Keep in mind that it's a student project and it's good to keep a critical mind while reviewing it.

Generating a Huffman code is O(n) for n symbols being coded, when the frequencies are sorted. The sorting in general takes O(n log n), so that is the time complexity for the Huffman algorithm.
The priority queue in your example is one way to do the sorting. There are other approaches where the sorting is done before doing Huffman coding.
Generally, LZ77 and Huffman coding are combined.

Related

Why does array implemented heap have constant runtime in add in practice?

Above is the runtime chart that I got from one of the lecture on Array Implemented Min-Heap.
It says that add is constant in practice with worst time logN.
Would there be a concrete reasoning why in practice, add will be constant run time?
More precisely, the average time complexity of insertion is constant. This applies for repeated insertion or random heaps, basically as long as you're not inserting smaller and smaller values all the time or otherwise making sure that it always has the worst-case complexity.
Very quickly and vaguely explained, the intuition is that basically about half of the nodes of a heap are leaf nodes, and about half of the rest are their parents, etc. So you are very likely to end up at a constant height from the leaf nodes in the tree (which translates to a constant number of updates), and this probability makes up for the rare log(n) scenarios.
If you are interested in formal proofs, this paper should interest you:
Average case analysis of heap building by repeated insertion (accessible here on Waybackmachine).
I also found this Stackoverflow answer for you, which gives some more intuition on the problem.

How to reduce the auxiliary memory of the below two binary tree related problems : [ grand parent and uncle related problems ]

I was asked to solve a binary tree traversal related problem recently where the aim is to find sum of all nodes in a binary tree where node is odd and its uncle is also odd. I came with a solution as below which is O(n) in algorithmic complexity ( 1 time full traversal of the tree ) and auxillary memory usage which is equal to O(h). If and only if the binary tree happends to be BST and height balanced then it can be argued that the auxillary memory complexity will be O(log(n)).
My solution is a variation on the path identification of all root to leaf problem. This problem and its solution can be found here.
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set2/rootleaf.cpp
The solution to the odd node with odd uncle is given here.
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set5/sumodduncle.cpp
The interviewer agreed that the algorithmic complexity is obvious as one traversal is definitely needed and it is O(n). But he argued that the auxiliary memory complexity can be designed much better than O(h) and he did not tell what the approach was. I have been thinking about this for 2 weeks now and haven't got a better solution yet.
I cleared the interview btw and was offered a role that I am considering now, but I still don't know what the better approach to auxiliary memory tuning is here. Can it be O(1) sounds not possible until somehow we keep track at every node only the parent and grandparent which is then O(1).is that possible?
https://github.com/anandkulkarnisg/BinaryTree/blob/main/set5/sumodduncle.cpp In this code module the solution using the below invocation...
long sumAlt=findSumOddUncle(uniqueBstPtr);
Is the O(1) solution because all variables are passed via pointer and only the sum is passed which accumulates the total in recursive calls. Tested and works as expected.

Splay tree: worst-case sequence

I want to try trying executing the worst-case sequence on Splay tree.
But what is the worst-case sequence on Splay-trees?
And are there any way to calculate this sequence easily given the keys which is inserted into the tree?
Any can help me with this?
Unless someone corrects me, I'm going to go with "no one actually knows what the worst-case series of operations is on a splay tree or what the complexity is in that case."
While we do know many results about the efficiency of splay trees, we actually don't know all that much about how to bound the time complexity of a splay tree. There's a conjecture called the dynamic optimality conjecture that says that in the worst case, any sufficiently long series of operations on a splay tree will take no more than a constant amount of time more than the best possible self-adjusting binary search tree on that series of operations. One of the challenges we're having in trying to prove this is that no one actually knows how to determine the cost of the best possible BST on all inputs. Another is that finding upper bounds on the runtimes of various input combinations to splay trees is hard - as of now, no one knows whether it takes time O(n) to treat a splay tree as a deque!
Hope this helps!
I don't know if an attempt of an answer after more than five years is of any use to you, but sorry, I made my Master in CS only recently :-) In the wake of that, I played around exactly with your question.
Consider the sequence S(3,2) (it should be obvious how S(m,n) works generally if you graph it): S(3,2)=[5,13,6,14,3,15,4,16,1,17,2,18,11,19,12,20,9,21,10,22,7,23,8,24]. Splay is so lousy on this sequence type that the competive ratio r to the "Greedy Future" algorithm (see Demaine) is S[infty,infty]=2. I was never able to get over 2 even though Greedy Future is also not completely optimal and I could shave off a few operations.
(Legend: black,grey,blue: S(7,4); purple,orange,red: Splay must access these points too. Shown in the Demaine formulation.)
But note that your question is somewhat ill defined! If you ask for the absolutely worst sequence, then take e.g. the bit-reversal sequence, ANY tree algorithm needs O(n log n) for that. But if you ask for the competetive ratio r as implied in templatetypdef's answer, then indeed nobody knows (but I would make bets on r=2, see above).
Feel free to email me for details, I'm easily googled.

Balanced binary trees versus indexed skiplists

Not sure if the question should be here or on programmers (or some other SE site), but I was curious about the relevant differences between balanced binary trees and indexable skiplists. The issue came up in the context of this question. From the wikipedia:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.
Don't the space requirements of a skiplist depend on the depth of the hierarchy? And aren't binary trees easier to use, at least for searching (granted, insertion and deletion in balanced BSTs can be tricky)? Are there other advantages/disadvantages to skiplists?
(Some parts of your question (ease of use, simplicity, etc.) are a bit subjective and I'll answer them at the end of this post.)
Let's look at space usage. First, let's suppose that you have a binary search tree with n nodes. What's the total space usage required? Well, each node stores some data plus two pointers. You might also need some amount of information to maintain balance information. This means that the total space usage is
n * (2 * sizeof(pointer) + sizeof(data) + sizeof(balance information))
So let's think about an equivalent skiplist. You are absolutely right that the real amount of memory used by a skiplist depends on the heights of the nodes, but we can talk about the expected amount of space used by a skiplist. Typically, you pick the height of a node in a skiplist by starting at 1, then repeatedly flipping a fair coin, incrementing the height as long as you flip heads and stopping as soon as you flip tails. Given this setup, what is the expected number of pointers inside a skiplist?
An interesting result from probability theory is that if you have a series of independent events with probability p, you need approximately 1 / p trials (on expectation) before that event will occur. In our coin-flipping example, we're flipping a coin until it comes up tails, and since the coin is a fair coin (comes up heads with probability 50%), the expected number of trials necessary before we flip tails is 2. Since that last flip ends the growth, the expected number of times a node grows in a skiplist is 1. Therefore, on expectation, we would expect an average node to have only two pointers in it - one initial pointer and one added pointer. This means that the expected total space usage is
n * (2 * sizeof(pointer) + sizeof(data))
Compare this to the size of a node in a balanced binary search tree. If there is a nonzero amount of space required to store balance information, the skiplist will indeed use (on expectation) less memory than the balanced BST. Note that many types of balanced BSTs (e.g. treaps) require a lot of balance information, while others (red/black trees, AVL trees) have balance information but can hide that information in the low-order bits of its pointers, while others (splay trees) don't have any balance information at all. Therefore, this isn't a guaranteed win, but in many cases it will use space.
As to your other questions about simplicity, ease, etc: that really depends. I personally find the code to look up an element in a BST far easier than the code to do lookups in a skiplist. However, the rotation logic in balanced BSTs is often substantially more complicated than the insertion/deletion logic in a skiplist; try seeing if you can rattle off all possible rotation cases in a red/black tree without consulting a reference, or see if you can remember all the zig/zag versus zag/zag cases from a splay tree. In that sense, it can be a bit easier to memorize the logic for inserting or deleting from a skiplist.
Hope this helps!
And aren't binary trees easier to use, at least for searching
(granted, insertion and deletion in balanced BSTs can be tricky)?
Trees are "more recursive" (trees and subtrees) and SkipLists are "more iterative" (levels in an array). Of course, it depends on implementation, but SkipLists can also be very useful for practical applications.
It's easier to search in trees because you don't have to iterate levels in an array.
Are there other advantages/disadvantages to skiplists?
SkipLists are "easier" to implement. This is a little relative, but it's easier to implement a full-functional SkipList than deletion and balance operations in a BinaryTree.
Trees can be persistent (better for functional programming).
It's easier to delete items from SkipLists than internal nodes in a binary tree.
It's easier to add items to binary trees (keeping the balance is another issue)
Binary Trees are deterministic, so it's easier to study and analyze them.
My tip: If you have time, you must use a Balanced Binary Tree. If you have little time, use a Skip List. If you have no time, use a Library.
Something not mentioned so far is that skip lists can be advantageous for concurrent operations. If you read the source of ConcurrentSkipListMap, authored by Doug Lea... dig into the comments. It mentions:
there are no known efficient lock-free insertion and deletion algorithms for search trees. The immutability of the "down" links of index nodes (as opposed to mutable "left" fields in true trees) makes this tractable using only CAS operations.
You're right that this isn't the perfect forum.
The comment you quoted was written by the author of the original skip list paper: not exactly an unbiased assertion. It's been 23 years, and red-black trees still seem to be more prevalent than skip lists. An exception is redis key-value pair database, which includes skip lists as one option among its data structures.
Skip lists are very cool. But the only space advantage I've been able to show in the general randomized case is no need to store balance flags: two bits per value. This is assuming the hierarchy is dense enough to replicate binary tree performance. You can chalk this up as the price of determinism (vice. randomization). A nice feature of SL's is you can use less dense hierarchies to trade constant factors of speed for space.
Side note: it's not often discussed that if you don't need to traverse in sorted order, you can randomize unbalanced binary trees by just enciphering the keys (i.e. mapping to a pseudo-random cipher text with something very simple like RC4). Such trees are absolutely trivial to implement.

Why are Fibonacci numbers significant in computer science?

Fibonacci numbers have become a popular introduction to recursion for Computer Science students and there's a strong argument that they persist within nature. For these reasons, many of us are familiar with them.
They also exist within Computer Science elsewhere too; in surprisingly efficient data structures and algorithms based upon the sequence.
There are two main examples that come to mind:
Fibonacci heaps which have better
amortized running time than binomial
heaps.
Fibonacci search which shares
O(log N) running time with binary
search on an ordered array.
Is there some special property of these numbers that gives them an advantage over other numerical sequences? Is it a spatial quality? What other possible applications could they have?
It seems strange to me as there are many natural number sequences that occur in other recursive problems, but I've never seen a Catalan heap.
The Fibonacci numbers have all sorts of really nice mathematical properties that make them excellent in computer science. Here's a few:
They grow exponentially fast. One interesting data structure in which the Fibonacci series comes up is the AVL tree, a form of self-balancing binary tree. The intuition behind this tree is that each node maintains a balance factor so that the heights of the left and right subtree differ by at most one. Because of this, you can think of the minimum number of nodes necessary to get an AVL tree of height h is defined by a recurrence that looks like N(h + 2) ~= N(h) + N(h + 1), which looks a lot like the Fibonacci series. If you work out the math, you can show that the number of nodes necessary to get an AVL tree of height h is F(h + 2) - 1. Because the Fibonacci series grows exponentially fast, this means that the height of an AVL tree is at most logarithmic in the number of nodes, giving you the O(lg n) lookup time we know and love about balanced binary trees. In fact, if you can bound the size of some structure with a Fibonacci number, you're likely to get an O(lg n) runtime on some operation. This is the real reason that Fibonacci heaps are called Fibonacci heaps - the proof that the number of heaps after a dequeue min involves bounding the number of nodes you can have in a certain depth with a Fibonacci number.
Any number can be written as the sum of unique Fibonacci numbers. This property of the Fibonacci numbers is critical to getting Fibonacci search working at all; if you couldn't add together unique Fibonacci numbers into any possible number, this search wouldn't work. Contrast this with a lot of other series, like 3n or the Catalan numbers. This is also partially why a lot of algorithms like powers of two, I think.
The Fibonacci numbers are efficiently computable. The fact that the series can be generated extremely efficiently (you can get the first n terms in O(n) or any arbitrary term in O(lg n)), then a lot of the algorithms that use them wouldn't be practical. Generating Catalan numbers is pretty computationally tricky, IIRC. On top of this, the Fibonacci numbers have a nice property where, given any two consecutive Fibonacci numbers, let's say F(k) and F(k + 1), we can easily compute the next or previous Fibonacci number by adding the two values (F(k) + F(k + 1) = F(k + 2)) or subtracting them (F(k + 1) - F(k) = F(k - 1)). This property is exploited in several algorithms, in conjunction with property (2), to break apart numbers into the sum of Fibonacci numbers. For example, Fibonacci search uses this to locate values in memory, while a similar algorithm can be used to quickly and efficiently compute logarithms.
They're pedagogically useful. Teaching recursion is tricky, and the Fibonacci series is a great way to introduce it. You can talk about straight recursion, about memoization, or about dynamic programming when introducing the series. Additionally, the amazing closed-form for the Fibonacci numbers is often taught as an exercise in induction or in the analysis of infinite series, and the related matrix equation for Fibonacci numbers is commonly introduced in linear algebra as a motivation behind eigenvectors and eigenvalues. I think that this is one of the reasons that they're so high-profile in introductory classes.
I'm sure there are more reasons than just this, but I'm sure that some of these reasons are the main factors. Hope this helps!
Greatest Common Divisor is another magic; see this for too many magics. But Fibonacci numbers are easy to calculate; also it has a specific name. For example, natural numbers 1,2,3,4,5 have too many logic; all primes are within them; sum of 1..n is computable, each one can produce with other ones, ... but no one take care about them :)
One important thing I forgot about it is Golden Ratio, which has very important impact in real life (for example you like wide monitors :)
If you have an algorithm that can be successfully explained in a simple and concise mannor with understandable examples in CS and nature, what better teaching tool could someone come up with?
Fibonacci sequences are indeed found everywhere in nature/life. They're useful at modeling growth of animal populations, plant cell growth, snowflake shape, plant shape, cryptography, and of course computer science. I've heard it being referred to as the DNA pattern of nature.
Fibonacci heap's have already been mentioned; the number of children of each node in the heap is at most log(n). Also the subtree starting a node with m children is at least (m+2)th fibonacci number.
Torrent like protocols which use a system of nodes and supernodes use a fibonacci to decide when a new super node is needed and how many subnodes it will manage. They do node management based on the fibonacci spiral (golden ratio). See the photo below how nodes are split/merged (partitioned from one large square into smaller ones and vice versa). See photo: http://smartpei.typepad.com/.a/6a00d83451db7969e20115704556bd970b-pi
Some occurences in nature
http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/sneezewort.GIF
http://img.blogster.com/view/anacoana/post-uploads/finger.gif
http://jwilson.coe.uga.edu/EMAT6680/Simmons/6690Pictures/pinecone3yellow.gif
http://2.bp.blogspot.com/-X5II-IhjXuU/TVbHrpmRnLI/AAAAAAAAABU/nv73Y9Ylkkw/s320/amazing_fun_featured_2561778790105101600S600x600Q85_200907231856306879.jpg
I don't think there's a definitive answer but one possibility is that the operation of dividing a set S into two partitions S1 and S2 one of which is then divided into to sub-partitions S11 and S12, one of which has the same size as S2 - is a likely approach to many algorithms and that can be sometimes numerically described as a Fibonacci sequence.
Let me add another data structure to yours: Fibonacci trees. They are interesting because the calculation of the next position in the tree can be done by mere addition of the previous nodes:
http://xw2k.nist.gov/dads/html/fibonacciTree.html
It ties well in with the discussion by templatetypedef on AVL-trees (an AVL tree can at worst have fibonacci structure). I've also seen buffers extended in fibonacci-steps rather than powers of two in some cases.
Just to add a trivia about this, Fibonacci numbers describe the breading of rabbits. You start with (1, 1), two rabbits, and then their population grows exponentially .
Their computation as a power of [[0,1],[1,1]] matrix can be considered as the most primitive problem of Operational Research (sort of like Prisoner's Dilemma is the most primitive problem of Game Theory).
Symbols with frequencies that are successive fibonacci numbers create maximum depth huffman trees, which trees correspond to source symbols being encoded with maximum length binary codes. Non-fibonacci source symbol frequencies create more balanced trees, with shorter codes. The code length has direct implications in the description complexity of the finite state machine that is responsible for decoding a given huffman code.
Conjecture: The 1st(fib) image will be compressed to 38bits, while the 2nd(uniform) with 50bits. It seems that the closer your source symbol frequencies are to fibonacci numbers the shorter the final binary sequence, the better the compression, maybe optimal in the huffman model.
Further Reading:
Buro, M. (1993). On the maximum length of Huffman codes. Information
Processing Letters, 45(5), 219-223. doi:10.1016/0020-0190(93)90207-p
For me This is about order and space coordinates.
The Fibonacci sequence can be used as a clock.
The Fibonacci sequence allows to calculate the golden number decimal by decimal.
The golden number multiplied by itself gives almost the golden number +1.
So we can certainly cut an integer into a series of integers, of units by using for example the indexes.
I made a first naive version in python.(poc) code to be updated.
https://gitlab.com/numbers/Numbers/-/blob/main/ranging.py
So we can frame, count and coordinate the calculation steps and the memory spaces to this perfectly periodic reference frame (in time) and thus make it a kind of universal multiplication table equivalent. For me it is explicitly a mapping.
The idea is to eventually propose a ternary code with explicit management of the memory spaces according to the Fibonacci calculation step, and then to find all our numbers there.
Once done, to use this mapping, this universal table, this filter : to check the concordance, the consistency, the periodicity of complex computable operations, such as the wheeler experiment, sinus, gravity etc...
It sounds pretentious when you say it like that. It is not. Nobody create the golden number or Fibonacci. They are here, they are given like fruits on a tree.

Resources