What would be the length of the shortest code and the longest Huffman code for n characters with Fibonacci frequencies?
From what I understand - if we build a tree, it will look like a one branch with each node of length 1 hanging off, from the root to the lowest leaf. When we create the first node out of n-2 numbers, this node's frequency will be F[n]-1, and F[n]>F[n]-1>F[n-1]. (F[n-1] being the least remaining and F[n] will be the second least remaining), which, by induction, would apply to all the frequencies.
The tree we create is clearly an unbalanced tree, which, i assume, is not good.
If this is the optimal way to create a tree, what would be the length of the longest way to create it? If it is not the optimal way, then what would be the length of the shortest way?
I am new to computer science and I would really appreciate a good explanation.
The shortest code would be length 1 and the longest would be length n-1. There would be two symbols with length n-1, and there is one symbol for each length in 1..n-2.
There is only one optimal tree, and that's it. There is nothing bad about it being unbalanced. In fact it has to be that way to use the least number of bits to codes those symbols with the those frequencies.
I have no idea what you mean by the "shortest" or "longest" way.
Related
Last week in an interview I was asked the above question and as expected I wasn't able to answer it correctly, later on when I checked I saw that its a dynamic programming based algorithm. I am not proficient in dynamic programming but suppose I was to design this algorithm then how should I approach it?
Suppose, I take idea from other divide and conquer algorithms like MergeSort and design the solution something like:
Divide the sequence in two equal halves.
Find the longest increasing sub-sequence in two halves
Join the two halves.
Obviously there are missing pieces, but how get forward from here?
Your proposal won't work, because the longest sequences in both halves usually won't be contiguous, and there could exist a longer sequence when you join the halves.
You can fix this as follows:
in both halves, find the longest increasing sub-sequence, let L and R;
in both halves, find the longest increasing sub-sequence which is left-aligned, let LL and RL;
in both halves, find the longest increasing sub-sequence which is right-aligned, let LR and RR;
for the longest, keep the longest of L, R, LR+RL if the latter forms an increasing sequence;
for the left-aligned, keep LL or the whole left sub-sequence + RL if this forms an increasing sub-sequence;
for the right-aligned, keep RR or LR + the whole right sub-sequence if this forms an increasing sub-sequence.
All these operations are done in a single recursive process. When you concatenete two sub-sequences, checking if they form an increasing sub-sequence just takes the comparison of the facing elements.
Update:
This "fix" was not thoroughly checked.
Suppose we have a balanced binary search tree T holding n numbers. We are given two
numbers L and H and wish to sum up all the numbers in T that lie between L and H. Suppose
there are m such numbers in T.Can someone explain how to calculate the absolute value of the time taken to compute the sum..?
I'll leave you to work out the full details, but here's a start. The algorithm will go:
Find the smallest number in the tree that's greater than L. You can do that in log time.
Walk the tree, each time moving to the next largest, and adding it to a running total.
Stop when you reach a number that's at least H.
I've assumed that "lie between" means "strictly between", but you might want weak inequalities in steps 1 and 3.
I was going through Cormen's 'Algorithms Unlocked'. In chapter 6 on shortest path algorithms, on inserting data into a binary heap, I find this: "Since the path to the root has at most floor(lg(n)) edges, at most floor(lg(n))-1 exchanges occur, and so INSERT takes O(lg(n)) time." Now, I know the resulting complexity of insertion in a binary heap is as mentioned, but about the number of exchanges in the worst case, should it not be floor(lg(n)) instead of floor(lg(n))-1. The book's errata says nothing regarding this. So I was wondering if I missed something.
Thanks and Regards,
Aditya
You can easily show that it's floor(lg(n)). Consider this binary heap:
3
5 7
To insert the value 1, you first add it to the end of the heap:
3
5 7
1
So there are 4 items in the heap. It's going to take two swaps to move the item 1 to the root. floor(lg(4)) is equal to 2.
floor(lg(n)) is the correct expression for the maximum number of edges on a path between a leaf and the root, and when you do swaps, you may end up doing one swap for each edge. So floor(lg(n)) is the correct answer for the worst-case number of swaps. The author most likely confused the number of edges on the path with the number of VERTICES on the path when they were writing. If you have V vertices on the path between the leaf and the root, then V-1 is the number of edges so V-1 is the number of swaps you might do in the worst-case.
Check if 2 tree nodes are related (i.e. ancestor-descendant)
solve it in O(1) time, with O(N) space (N = # of nodes)
pre-processing is allowed
That's it. I'll be going to my solution (approach) below. Please stop if you want to think yourself first.
For a pre-processing I decided to do a pre-order (recursively go through the root first, then children) and give a label to each node.
Let me explain the labels in details. Each label will consist of comma-separated natural numbers like "1,2,1,4,5" - the length of this sequence equals to (the depth of the node + 1). E.g. the label of the root is "1", root's children will have labels "1,1", "1,2", "1,3" etc.. Next-level nodes will have labels like "1,1,1", "1,1,2", ..., "1,2,1", "1,2,2", ...
Assume that "the order number" of a node is the "1-based index of this node" in the children list of its parent.
Common rule: node's label consists of its parent label followed by comma and "the order number" of the node.
Thus, to answer if two nodes are related (i.e. ancestor-descendant) in O(1), I'll be checking if the label of one of them is "a prefix" of the other's label. Though I'm not sure if such labels can be considered to occupy O(N) space.
Any critics with fixes or an alternative approach is expected.
You can do it in O(n) preprocessing time, and O(n) space, with O(1) query time, if you store the preorder number and postorder number for each vertex and use this fact:
For two given nodes x and y of a tree T, x is an ancestor of y if and
only if x occurs before y in the preorder traversal of T and after y
in the post-order traversal.
(From this page: http://www.cs.arizona.edu/xiss/numbering.htm)
What you did in the worst case is Theta(d) where d is the depth of the higher node, and so is not O(1). Space is also not O(n).
if you consider a tree where a node in the tree has n/2 children (say), the running time of setting the labels will be as high as O(n*n). So this labeling scheme wont work ....
There are linear time lowest common ancestor algorithms(at least off-line). For instance have a look here. You can also have a look at tarjan's offline LCA algorithm. Please note that these articles require that you know the pairs for which you will be performing the LCA in advance. I think there are also online linear time precomputation time algorithms but they are very complex. For instance there is a linear precomputation time algorithm for the range minimum query problem. As far as I remember this solution passed through the LCA problem twice . The problem with the algorithm is that it had such a large constant that it require enormous input to be actually faster then the O(n*log(n)) algorithm.
There is much simpler approach that requires O(n*log(n)) additional memory and again answers in constant time.
Hope this helps.
I felt that they are very similar to each other, except at some concepts. In external sorting their functions are basically the same, that is to find the minimal/maximal value in k runs. So are there some significant differences between them two ?
For the most part, loser trees and heaps are quite similar. However, there are a few important distinctions. The loser tree, because it provides the loser of each match, will contain repeat nodes. Since the heap is a data-storing structure, it won't contain these redundancies. Another difference between the two is that the loser tree must be a full binary tree (because it is a type of tournament tree), but the heap does not necessarily have to be binary.
Finally, to understand a specific quality of the loser tree, consider the following problem:
Suppose we have k sequences, each of which is sorted in nondecreasing order, that are to be merged into one sequence in nondecreasing order. This can be achieved by repeatedly transferring the element with the smallest key to an output array. The smallest key has to be found from the leading elements in the k sequences. Ordinarily, this would require k − 1 comparisons for each element transferred. However, with a loser tree, this can be reduced to log2 k comparisons per element.
Source: Handbook of Data Structures and Applications, Dinesh Mehta