Ambiguity in Huffman Tree - algorithm

I want to create a Huffman tree for four symbols a,b,c,d occurring with frequencies 5,4,3,2.
For simplicity, I'll ignore the symbol labels themselves and focus only on the frequency labels.
The first step, is to create 4 single vertex trees whose roots are labelled with the frequencies 5,4,3,2.
Next, merge the two single vertex trees whose roots are labelled 2,3 to get a three vertex tree whose root is labelled 3+2=5 and whose children are labelled 2,3.
The next step is to merge the tree labelled 4 with one of the two other trees, both of whose roots are labeled with frequency 5.
Which one to we choose?
The two choices lead to very different trees and Huffman codes. Specifically, one choice leads to a code with codeword lengths 1,2,3,3 and the other choice leads to codeword lengths 2,2,2,2.
In this particular case, the former code can't possibly be correct because its expected codeword length is longer ... this would violate the famous optimality of Huffman codes.
What is a general prescription for constructing trees that always yields an optimal Huffman code?

Mark Adler clarified this. For the record,
The expected codeword length for both codes proposed above are exactly 2 bits.
For the first code:
$5/14 * 1 + 4/14 * 2 + 3/14 * 3 + 2/14 * 3 = 2$
For the second code:
$5/14 * 2 + 4/14 * 2 + 3/14 * 2 + 2/14 * 2 = 2$
I guess that whichever way I break ties, the resulting code is optimal!
Thanks Mark!

Related

How to determine if two binary trees are equal or different

The picture below is the case of different binary trees that can be made with 3 nodes.
But why is the following case not included in the number of cases?
Is it the same case as the third case from the left in the picture above? If so, I think the parent-child relationship will be different.
I'd appreciate it if you could tell me how to determine how binary trees are equal to or different from each other.
You mean 3 nodes. Your case is not included because the cases all use A as the root node, so it is easier to demonstrate the different possible combinations using always the same elements in the same order, i.e. A as root, then B -> C on top and symmetrically C -> B at the bottom.
Using B or C as the root, you could achieve the same number of variations. This number can be computed using the following formula:
In this case, n=3, so F(0)*F(2) + F(1)*F(1) + F(2)*F(0) = 2 + 1 + 2 = 5
F(0) = 1 (empty is considered one variation)
F(1) = 1
F(2) = 2
So in your picture, only one row is actually relevant if that is supposed to be a binary search tree. Also note that there is a difference between a binary tree and a binary search tree. From first article below:
As we know, the BST is an ordered data structure that allows no
duplicate values. However, Binary Tree allows values to be repeated
twice or more. Furthermore, Binary Tree is unordered.
References
https://www.baeldung.com/cs/calculate-number-different-bst
https://en.wikipedia.org/wiki/Binary_tree#Using_graph_theory_concepts
https://encyclopediaofmath.org/wiki/Binary_tree

Touching segments

Can anyone please suggest me algorithm for this.
You are given starting and the ending points of N segments over the x-axis.
How many of these segments can be touched, even on their edges, by exactly two lines perpendicular to them?
Sample Input :
3
5
2 3
1 3
1 5
3 4
4 5
5
1 2
1 3
2 3
1 4
1 5
3
1 2
3 4
5 6
Sample Output :
Case 1: 5
Case 2: 5
Case 3: 2
Explanation :
Case 1: We will draw two lines (parallel to Y-axis) crossing X-axis at point 2 and 4. These two lines will touch all the five segments.
Case 2: We can touch all the points even with one line crossing X-axis at 2.
Case 3: It is not possible to touch more than two points in this case.
Constraints:
1 ≤ N ≤ 10^5
0 ≤ a < b ≤ 10^9
Let assume that we have a data structure that supports the following operations efficiently:
Add a segment.
Delete a segment.
Return the maximum number of segments that cover one point(that is, the "best" point).
If have such a structure, we can get use the initial problem efficiently in the following manner:
Let's create an array of events(one event for the start of each segment and one for the end) and sort by the x-coordinate.
Add all segments to the magical data structure.
Iterate over all events and do the following: when a segment start, add one to the number of currently covered segments and remove it from that data structure. When a segment ends, subtract one from the number of currently covered segment and add this segment to the magical data structure. After each event, update the answer with the value of the number of currently covered segments(it shows how many segments are covered by the point which corresponds to the current event) plus the maximum returned by the data structure described above(it shows how we can choose another point in the best possible way).
If this data structure can perform all given operations in O(log n), then we have an O(n log n) solution(we sort the events and make one pass over the sorted array making a constant number of queries to this data structure for each event).
So how can we implement this data structure? Well, a segment tree works fine here. Adding a segment is adding one to a specific range. Removing a segment is subtracting one from all elements in a specific range. Get ting the maximum is just a standard maximum operation on a segment tree. So we need a segment tree that supports two operations: add a constant to a range and get maximum for the entire tree. It can be done in O(log n) time per query.
One more note: a standard segment tree requires coordinates to be small. We may assume that they never exceed 2 * n(if it is not the case, we can compress them).
An O(N*max(logN, M)) solution, where M is the medium segment size, implemented in Common Lisp: touching-segments.lisp.
The idea is to first calculate from left to right at every interesting point the number of segments that would be touched by a line there (open-left-to-right on the lisp code). Cost: O(NlogN)
Then, from right to left it calculates, again at every interesting point P, the best location for a line considering segments fully to the right of P (open-right-to-left on the lisp code). Cost O(N*max(logN, M))
Then it is just a matter of looking for the point where the sum of both values tops. Cost O(N).
The code is barely tested and may contain bugs. Also, I have not bothered to handle edge cases as when the number of segments is zero.
The problem can be solved in O(Nlog(N)) time per test case.
Observe that there is an optimal placement of two vertical lines each of which go through some segment endpoints
Compress segments' coordinates. More info at What is coordinate compression?
Build a sorted set of segment endpoints X
Sort segments [a_i,b_i] by a_i
Let Q be a priority queue which stores right endpoints of segments processed so far
Let T be a max interval tree built over x-coordinates. Some useful reading atWhat are some sources (books, etc.) from where I can learn about Interval, Segment, Range trees?
For each segment make [a_i,b_i]-range increment-by-1 query to T. It allows to find maximum number of segments covering some x in [a,b]
Iterate over elements x of X. For each x process segments (not already processed) with x >= a_i. The processing includes pushing b_i to Q and making [a_i,b_i]-range increment-by-(-1) query to T. After removing from Q all elements < x, A= Q.size is equal to number of segments covering x. B = T.rmq(x + 1, M) returns maximum number of segments that do not cover x and cover some fixed y > x. A + B is a candidate for an answer.
Source:
http://www.quora.com/What-are-the-intended-solutions-for-the-Touching-segments-and-the-Smallest-String-and-Regex-problems-from-the-Cisco-Software-Challenge-held-on-Hackerrank

Huffman Tree with Max-Height, Nice Questions?

I ran into a nice question in one Solution of Homework in DS course.
which of the following (for large n) create the most height for Huffman Tree. the elements of each sequence in following option shows the frequencies of character in input text and not shown the characters.
1) sequence of n equal numbers
2) sequence of n consecutive Fibonacci numbers.
3) sequence <1,2,3,...,n>
4) sequence <1^2,2^2,3^2,...,n^2>
Anyone could say, why this solution select (2)? thanks to anyone.
Let's analyze the various options here.
A sequence of N equal numbers means a balanced tree will be created with the actual symbols at the bottom leaf nodes.
A sequence 1-N has the property that as you start grouping the two lowest element their sum will quickly rise above other elements, here's an example:
As you can see, the groups from 4+5 and 7+8 did not by themselves contribute to the height of the tree.
After grouping the two 3-nodes into a 6, nodes 4 and 5 are the next in line, which means that each new group formed won't contribute to its height. Most will, but not all, and that's the important fact.
A sequence using squares (note: squares as in the third sequence in the question, 1^2, 2^2, 3^2, 4^2, ..., N^2, not square diagram elements) has somewhat the same behavior as a sequence of 1-N, some of the time other elements than the one that was just formed will be used, which cuts down on the height:
As you can see here, the same happened to 36+49, it did not contribute to the height of the tree.
However, the fibonacci sequence is different. As you group the two lowest nodes, their sum will at most topple the next item but not more than one of them, which means that each new group being formed will be used in the next as well, so that each new group formed will contribute to the height of the tree. This is different from the other 3 examples.

Is there a algorithm to extract the minimum number of Cartesian products from a set of formulas?

For example, we have a set of formulas as below:
B*2*j
B*3*i
B*3*j
C*2*j
C*3*i
C*3*j
D*2*i
D*2*j
D*3*i
D*3*j
And we could have three Cartesian products to represent the formulas above:
D*(2+3)*(i+j)
(B+c)*3*(i+j)
(B+C)*2*j
So the total number is 3. And we could also have:
3*(B+C+D)*(i+j)
2*(B+C)*D
2*D*(i+j)
which is also 3.
I wanna ask that is there a algorithm to determine the minimum number of Cartesian products from a set of formulas? And also come up with these products?
First, I'll write a set of formulas as terms separated by +, since the transformation you're looking for makes sense algebraically (apart from the fact that you don't want to combine numbers like 2+3 into 5).
The basic operation that you have available is factorising: combining two terms like ABC+ABD into AB(C+D). Based on your comment, you can only generate new factors that consist of a sum of single-factor terms, like C+D in the previous example; you're not allowed to factorise e.g. ABCD+ABDE into AB(CD+DE).
You can factorise 2 k-factor terms if and only if they share exactly k-1 factors. (E.g. k=3 in my ABC+ABD example.) Every such factorisation reduces the number of terms in the set by 1: 2 are removed and 1 is added back in.
Doing this multiple times works when combining 3 or more terms: ABC+ABD+ABE can first be factorised into AB(C+D)+ABE and then those 2 terms factorised again into AB(C+D+E). Notice that it doesn't matter in which order we list terms in a sum or factors in a product, and nor does it matter in which order we perform factorisation steps when building a factor containing 3 or more terms.
We can then frame the problem as a search problem in a graph, in which the start vertex corresponds to the original formula (B*2*j + B*3*i + ... + D*3*j in your example) and from each vertex v there emanate arcs to its child vertices, which each correspond to the result of performing some factorisation on v. v will have a child vertex for each possible factorisation that could be performed on it; if there are m terms in v, then this means it could have up to m(m-1)/2 children in the worst case, because it could be that all m terms share a full complement of k-1 factors, meaning that any pair of them could be combined.
If a vertex has no pair of terms that can be combined via factorisation then it is a "leaf" -- it has no children, and can't be processed further. What we want to find is a leaf vertex that has the fewest number of terms. Since every factorisation, corresponding to an arc in the graph, reduces the number of terms by 1, this is equivalent to searching for a deepest-possible vertex. This can be done using DFS or BFS. Note however that the same expression (vertex) can be generated many times over using this approach, so it will be crucial for performance to maintain a hashtable seen that records all expressions that have already been processed; then if we visit a vertex, try to generate a child for it, and see that this child is already in seen, we avoid visiting this child a second time.
To mitigate against the phenomenon of the same expression being generated via multiple different orderings of the same set of factorisations, you can add a rule: order v's child factorisations somehow, so that if there are n children they correspond to factorisations 1, 2, ..., n in this ordering, and record in a separate "already skipped" field in each child vertex the set of earlier (in the ordering) factorisations that were skipped over to generate this child. Then, when visiting a vertex, avoid generating any of its "already skipped" factorisations as children, since doing so would create a vertex that is identical to some other existing vertex (by performing the same pair of operations in reverse order).
There are probably other speedups available that will reduce the number of duplicate vertices that are generated in the first place, but this should be enough to get results for small problems.
Write down you sum in matrix form. Then what you are asking for is the rank of that matrix, and a corresponding decomposition into dyadic products. This decomposition is far from unique.
[ 3 5 ] [ i ]
[ B C D ] * | 3 5 | * [ j ]
[ 5 5 ]
As one can see, the matrix in the middle has full rank 2
If you intend to use 2 and 3 also as variables, then you are asking to decompose a tensor of order 3 into a minimum number of terms that factorize, i.e., that are tensor products of vectors.

Eliminating symmetry from graphs

I have an algorithmic problem in which I have derived a transfer matrix between a lot of states. The next step is to exponentiate it, but it is very large, so I need to do some reductions on it. Specifically it contains a lot of symmetry. Below are some examples on how many nodes can be eliminated by simple observations.
My question is whether there is an algorithm to efficiently eliminate symmetry in digraphs, similarly to the way I've done it manually below.
In all cases the initial vector has the same value for all nodes.
In the first example we see that b, c, d and e all receive values from a and one of each other. Hence they will always contain an identical value, and we can merge them.
In this example we quickly spot, that the graph is identical from the point of view of a, b, c and d. Also for their respective sidenodes, it doesn't matter to which inner node it is attached. Hence we can reduce the graph down to only two states.
Update: Some people were reasonable enough not quite sure what was meant by "State transfer matrix". The idea here is, that you can split a combinatorial problem up into a number of state types for each n in your recurrence. The matrix then tell you how to get from n-1 to n.
Usually you are only interested about the value of one of your states, but you need to calculate the others as well, so you can always get to the next level. In some cases however, multiple states are symmetrical, meaning they will always have the same value. Obviously it's quite a waste to calculate all of these, so we want to reduce the graph until all nodes are "unique".
Below is an example of the transfer matrix for the reduced graph in example 1.
[S_a(n)] [1 1 1] [S_a(n-1)]
[S_f(n)] = [1 0 0]*[S_f(n-1)]
[S_B(n)] [4 0 1] [S_B(n-1)]
Any suggestions or references to papers are appreciated.
Brendan McKay's nauty ( http://cs.anu.edu.au/~bdm/nauty/) is the best tool I know of for computing automorphisms of graphs. It may be too expensive to compute the whole automorphism group of your graph, but you might be able to reuse some of the algorithms described in McKay's paper "Practical Graph Isomorphism" (linked from the nauty page).
I'll just add an extra answer building on what userOVER9000 suggested, if anybody else are interested.
The below is an example of using nauty on Example 2, through the dreadnaut tool.
$ ./dreadnaut
Dreadnaut version 2.4 (64 bits).
> n=8 d g -- Starting a new 8-node digraph
0 : 1 3 4; -- Entering edge data
1 : 0 2 5;
2 : 3 1 6;
3 : 0 2 7;
4 : 0;
5 : 1;
6 : 2;
7 : 3;
> cx -- Calling nauty
(1 3)(5 7)
level 2: 6 orbits; 5 fixed; index 2
(0 1)(2 3)(4 5)(6 7)
level 1: 2 orbits; 4 fixed; index 4
2 orbits; grpsize=8; 2 gens; 6 nodes; maxlev=3
tctotal=8; canupdates=1; cpu time = 0.00 seconds
> o -- Output "orbits"
0:3; 4:7;
Notice it suggests joining nodes 0:3 which are a:d in Example 2 and 4:7 which are e:h.
The nauty algorithm is not well documented, but the authors describe it as exponential worst case, n^2 average.
Computing symmetries seems to be a bit of a second order problem. Taking just a,b,c and d in your second graph, the symmetry would have to be expressed
a(b,c,d) = b(a,d,c)
and all its permutations, or some such. Consider a second subgraph a', b', c', d' added to it. Again, we have the symmetries, but parameterised differently.
For computing people (rather than math people), could we express the problem like so?
Each graph node contains a set of letters. At each iteration, all of the letters in each node are copied to its neighbours by the arrows (some arrows take more than one iteration and can be treated as a pipe of anonymous nodes).
We are trying to find efficient ways of determining things such as
* what letters each set/node contains after N iterations.
* for each node the N after which its set no longer changes.
* what sets of nodes wind up containing the same sets of letters (equivalence class)
?

Resources