Size constraints on DEFLATE Huffman trees - binary-tree

Reading through RFC 1951, it seems like a valid dynamic Huffman tree is not necessarily a full binary tree - for example, the tree specified by the bit lengths (2, 2, 2) has a node with a single child:
____
/ \
0 1
/ \ /
0 1 0
This causes an issue when attempting to allocate enough memory for the full tree in one allocation, as there is not necessarily any upper bound on the number of nodes in an arbitrary binary tree.
Do the constraints in the DEFLATE standard imply an upper bound on the size of a dynamic Huffman tree?

RFC 1951 defines "Huffman code" to be what is produced in the construction of an optimal prefix code. Therefore the Huffman codes used in a valid deflate stream must be complete, in order to be optimal. Your example is not, with the unused code 11. A 1-2-2 code would be complete, with no unused codes.
There is one exception to this rule in RFC 1951, which is that a solitary distance code is coded as one-bit, not zero bits:
If only one distance code is used, it is encoded using one bit, not
zero bits; in this case there is a single code length of one, with one
unused code.
zlib's inflate rejects any deflate stream with an invalid Huffman code.
There is one more subtlety, which is that the Huffman codes in deflate are length-limited to 15 bits, enforced in the encoding of the codes. This does not change the fact that the codes need to be complete.
For a complete prefix code, the number of internal nodes is n-1, where n is the number of symbols coded.
Due to the length-limit subtlety, there is a limit on the number of nodes even for non-optimal prefix codes. The worst case would be to assign the maximum number of bits to all of the symbols. Then just construct the tree and count the nodes. What you end up with is a flat code tree for the symbols, with the base of that connected to the root with a line of single branch nodes to the root. So actually, not many more nodes than for an optimal prefix code, due to the way a canonical code is constructed.
As an example, if all 30 distance codes have length 15, the canonical tree looks like this:
Instead of 29 internal nodes for an optimal prefix code, this one has 40 internal nodes.

Related

Huffman code with very large variety of elements

I am trying to implement the Huffman algorithm, as taught in my math class. However, I noticed that in the worst-case scenario, the produced code would be larger than the actual charset encoding.
Currently I only have flowcharts to show. This is strictly a theoretical question.
The issue is that the tree building algorithm does not guarantee a balanced tree where the leaves are evenly distributed, thus minimizing the tree height and therefore, minimizing the produced codeword lengths.
For values of the element frequencies follow a sequence such as Fibonacci's
N1 = 1, N2 = 2, Ni = Ni-1 + Ni-2.
Nk-1 < Nk < Nk+1 is always true.
when using the following algorithm to build the tree, n - 1 tree levels are needed, where n is the number of elements present:
pop the smaller two elements from list
make a new node
assign both popped nodes to either branch of the new node
the weight of the new node is the sum of the weight of its two branches
insert new node in list at the appropriate location according to weight
repeat until only one node remains
Given that Huffman codewords have a 1:1 relation to the height at which an element is present you would potentially need more bits than the un-compressed encoded character.
Imagine encoding a file written in English. Most of the times there will be more than 10 different characters present. These characters would use 8 bits in ascii or utf-8. However, to encode them using Huffman, in the worst-case scenario you would need up to 9bits. which is worse than the original encoding.
This problem would be exacerbated with larger sets or if you used combinations of characters where the number of possible combinations would be C(n, k), where n is the number of elements present and k is the number of characters to be represented by one codeword.
Obviously, there is something I am missing. I would dearly appreciate an explanation on how to solve this, or links to quality resources where I can learn more. Thank you.
Worst-case scenario:
/\
A \
/\
B \
/\
C …
/\
Y Z

Is there any tree for optimal prefix code other than Huffman tree? Will the height of that be the same as that of Huffman tree?

I have known that Huffman Tree is a kind of tree used for optimal prefix code, but is there any tree for optimal prefix code other than Huffman tree? If there are some that kind of trees, will their heights be the same?
Many thanks!
Huffman trees are constructed recursively by taking the two currently lowest probability symbols and combining them.
If there are other symbols with the same low probability, then these symbols can be combined instead.
This means that the final tree is not uniquely defined and there are multiple optimal prefix codes with potentially different heights.
For example, consider the symbols and probabilities below:
A 1/3
B 1/3
C 1/6
D 1/6
Can be encoded as:
A 0
B 10
C 110
D 111
or
A 00
B 01
C 10
D 11
Both encodings have an expected number of bits per symbol equal to 2, but different heights.
However, all optimal prefix codes can be constructed by the Huffman algorithm for a suitable choice of ordering with respect to probability ties.
Within the constraints of the Huffman code problem, i.e. representation of each symbol by a prefix-unique sequence of bits, then there is exactly one optimal total number of bits that can be achieved, and the Huffman algorithm achieves that. There are other approaches that arrive at the same answer.
As Peter de Rivaz noted, in certain special cases the Huffman algorithm has more than one choice at some steps in which of two minimum probability set of codes to pick, which can result in different trees. So the tree height/depth you mention is not unique, but the total number of bits (sum of the bit lengths of each symbol weighted by its probability) is always the same.

Predict Huffman compression ratio without constructing the tree

I have a binary file and i know the number of occurrences of every symbol in it. I need to predict the length of compressed file IF i was to compress it using Huffman algorithm. I am only interested in the hypothetical output length and not in the codes for individual symbols, so constructing Huffman tree seems redundant.
As an illustration i need to get something like "A binary string of 38 bits which contains 4 a's, 5 b's and 10 c's can be compressed down to 28 bits.", except both the file and the alphabet size are much larger.
The basic question is: can it be done without constructing the tree?
Looking at the greedy algorithm: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
it seems the tree can be constructed in n*log(n) time where n is the number of distinct symbols in the file. This is not bad asymptotically, but requires memory allocation for tree nodes and does a lot of work which in my case goes to waste.
the lower bound on the average number of bits per symbol in compressed file is nothing but the entropy H = -sum(p(x)*log(p(x))) for all symbols x in input. P(x) = freq(x)/(filesize). Using this compressed length(lower bound) = filesize*H. This is the lower bound on compressed size of file. But unfortunately the optimal entropy is not achievable in most case because bits are integral not fractional so in practical case the huffman tree is needed to be constructed to get correct compression size. But optimal compression size can be used to get the upper bound on amount of compression possible and to decide whether to use huffman or not.
You can upper bound the average bit counts per symbol in Huffman coding by
H(p1, p2, ..., pn) + 1 where H is entropy, each pi is probability of symbol i occuring in input. If you multiply this value by input size N it will give you approximated length of encoded output.
You could easily modify the algorithm to build the binary tree in an array. The root node is at index 0, its left node at index 1 and right node at index 2. In general, a node's children will be at (index*2) + 1 and (index * 2) + 2. Granted, this still requires memory allocation, but if you know how many symbols you have you can compute how many nodes will be in the tree. So it's a single array allocation.
I don't see where the work that's done really goes to waste. You have to keep track of the combining logic somehow, and doing it in a tree as shown is pretty simple. I know that you're just looking for the final answer--the length of each symbol--but you can't get that answer without doing the work.

What is the advantage of a full binary tree for Huffman code?

I am studying Huffman code for bit encoding a stream of characters and read that an optimal code would be represented by a full binary tree where each distinct character is represented by a leaf and all internal nodes contain exactly two children .
I want to know why the full binary tree is the optimal choice here ? In other words what is the advantage of full binary tree here ?
This is not a choice, but rather equivalence.
Optimal Huffman codes are decoded by a finite state machine, in which
each state has exactly two exits (the next bit being 0 or 1)
each state has exactly one entry
all states containing output symbols are stop states, and
all stop states contain output symbols
This is equivalent to a search tree where
all internal nodes have exactly two children
all nodes have exactly one parent
all nodes containing output symbols are leaf nodes, and
all leaf nodes contain output symbols
There are non-optimal Huffman codes as well, which have stop states / leaf nodes that do not contain output symbols. Such a binary tree would not be full.
Proof by contradiction:
Let us say that the tree T is not a full binary tree which provides optimal Huffman codes for the given characters and their frequencies. As T is not a full binary tree, there exists a node N which has only one child C.
Let us construct a new binary tree T' by replacing N with C. Depth of leaf nodes of C are reduced by 1 in T' compared to tree T. So T' provides a better solution that T, which proves that T is not optimal.
T T'
/\ /\
. N . C
. / .
. C .
You asked why a full binary tree. That is actually three questions.
If you're asking about "full", then it must be full for any correctly generated Huffman code.
If you're asking about "binary", every encountered bit in a Huffman code has two possibilities, 0 or 1, so each node must have two branches.
If however you're asking about "tree", you do not need to represent the code as a tree at all. There are many representations that not only represent the code completely, but also that facilitate both a shorter representation in the compressed stream and faster decoding, than a tree would.
Examples are using a canonical Huffman code, and representing it simply as the counts of symbols at each bit length, and a list of corresponding symbols. This is used in the puff.c code. Or you can generate a set of tables that decode several bits at a time in stages, which is used in zlib's inflate. There are others.

How to decode huffman tree?

is there better way than just go left or right based on the input digit 0 or 1?
There are some papers on efficient decoding algorithms for Huffman Trees. Personally, I only used one of them, for academic reasons, but it was a long time ago. The title of the paper was "A Memory Efficient and Fast Huffman Decoding Algorithm" by Hong-Chung Chen, Yue-Li Wang and Yu-Feng Lan.
The algorithm gives results in O(log n) time. In order to use this algorithm, you have to construct a table with all the symbols of your tree (leaves), and for each symbol you have to specify a weight:
w(i) = 2 ^ (h - l)
where h is the height of Huffman Tree and l is the level of the symbol, and a count:
count(i) = count(i-1) + w(i)
The count of root, count(0), is equal to its weight.
When you have all that, there are 3 simple steps in the algorithm, that are described in the paper.
I don't know if this is what you were looking for.
Yes, there is, and you can use lookup tables.
Note that you will use quite a bit of memory to store these tables, and you will either have to ship that table with the data (probably negating the effect of the compression altogether) or construct the table before decompressing (which would negate some, if not all, of the speedup you gain from it.)
Here's how the table would work.
Let's say, for a sample portion of compressed data that the bit-sequences for symbols are as follows:
1 -> A
01 -> B
00 -> C
What you do now is to produce a table, indexed by a byte (since you need to read 1 byte minimum to decode the first symbol), containing entries like this:
key symbol bit-shift
1xxxxxxx A 7
01xxxxxx B 6
00xxxxxx C 6
The x's means you need to store entries with all possible combinations for those x's, with that data. For the first row, this means you would construct a table where every byte-key with the high-bit set would map to A/7.
The table would have entries for all 256 key values, half of them mapped to A/7, and 25% to B/6 and C/6, according to the rules above.
Of course, if the longest bitsequence for a symbol is 9-16 bites, you need a table keyed by a 16-bit integer, and so on.
Now, when you decode, here's how you would do this:
read first and second byte as key, append them together
loop:
look up the first 8 bits of the current key in the table
emit symbol found in table
shift the key bitwise to the left the number of bits specified in the table
if less than 8 bits left in the key, get next byte and append to key
when at end, just pad with 0-bytes, and as with all Huffman decompression you need to know how many symbols to emit before you start.
Sure, instead of 2-trees you can use k-trees and get O(ln_k(n)) speedup. It's not much.
If the max key size is small (say < 8 bits) or you've got lots of memory, you can use a straight lookup table & get O(1).

Resources