Huffman code with very large variety of elements - algorithm

I am trying to implement the Huffman algorithm, as taught in my math class. However, I noticed that in the worst-case scenario, the produced code would be larger than the actual charset encoding.
Currently I only have flowcharts to show. This is strictly a theoretical question.
The issue is that the tree building algorithm does not guarantee a balanced tree where the leaves are evenly distributed, thus minimizing the tree height and therefore, minimizing the produced codeword lengths.
For values of the element frequencies follow a sequence such as Fibonacci's
N1 = 1, N2 = 2, Ni = Ni-1 + Ni-2.
Nk-1 < Nk < Nk+1 is always true.
when using the following algorithm to build the tree, n - 1 tree levels are needed, where n is the number of elements present:
pop the smaller two elements from list
make a new node
assign both popped nodes to either branch of the new node
the weight of the new node is the sum of the weight of its two branches
insert new node in list at the appropriate location according to weight
repeat until only one node remains
Given that Huffman codewords have a 1:1 relation to the height at which an element is present you would potentially need more bits than the un-compressed encoded character.
Imagine encoding a file written in English. Most of the times there will be more than 10 different characters present. These characters would use 8 bits in ascii or utf-8. However, to encode them using Huffman, in the worst-case scenario you would need up to 9bits. which is worse than the original encoding.
This problem would be exacerbated with larger sets or if you used combinations of characters where the number of possible combinations would be C(n, k), where n is the number of elements present and k is the number of characters to be represented by one codeword.
Obviously, there is something I am missing. I would dearly appreciate an explanation on how to solve this, or links to quality resources where I can learn more. Thank you.
Worst-case scenario:
/\
A \
/\
B \
/\
C …
/\
Y Z

Related

Big-O time to generate all letter-replacement words?

Let's say you have a word, such as 'cook', and you want to generate a graph of all possible words that could be made from that word by replacing each letter with all the other letters. An important restriction: you can ask the dictionary if a collection of letters is a word, but that is the limit of your interface to the dictionary. You cannot just ask the dictionary for all n-letter words.
I would imagine this would be a recursive algorithm generating a DAG such as follows:
cook
/ | \
aook ... zook
/ | \ / | \
aaok ... azok zaok ... zzok
And so on. Obviously in reality many of these permutations would be rejected as not being real words, but eventually you would have a DAG that contains all 'words' that can be generated. The height of the graph would be the length of the input word plus 1.
In the worst case, each permutation would be a word. The first level has 1 word, the second level 25, the next level (25*25) and so on. Thus if n is the length of the word, am I correct in thinking this would mean the algorithm has a worst-case time complexity of 25^n, and also a worst-case storage complexity of 25^n?
It is probably better not to view it as a tree, but as a graph.
Let n be the input, which is the length of the words.
Let W be all the (meaningful) words of length $n$.
Construct a simple graph G as follows: the vertices of G is W; two words w1 and w2 has an edge connecting them if and only if they differ by one charater.
Then what you want to do is the following:
Given a word w in W, find the connected component of w in the graph G.
This is typically done with depth-first search (DFS) or breadth-first search (BFS). Your algorithm is kind of BFS (but not quite correct: each time you should generate all neighbours of a word, not just those with one place replaced).
Since we may assume $n$ small, the time complexity is theoretically linear in the size of the result (although with a very large big O constant)
However, you should also have an equal size of memory, memorizing which word has already been checked.
If you generate only words that can be found in dictionary, then time complexity is limited by the dictionary size, so it should be O(min(|D|, 25^n)).

Huffman Tree with Max-Height, Nice Questions?

I ran into a nice question in one Solution of Homework in DS course.
which of the following (for large n) create the most height for Huffman Tree. the elements of each sequence in following option shows the frequencies of character in input text and not shown the characters.
1) sequence of n equal numbers
2) sequence of n consecutive Fibonacci numbers.
3) sequence <1,2,3,...,n>
4) sequence <1^2,2^2,3^2,...,n^2>
Anyone could say, why this solution select (2)? thanks to anyone.
Let's analyze the various options here.
A sequence of N equal numbers means a balanced tree will be created with the actual symbols at the bottom leaf nodes.
A sequence 1-N has the property that as you start grouping the two lowest element their sum will quickly rise above other elements, here's an example:
As you can see, the groups from 4+5 and 7+8 did not by themselves contribute to the height of the tree.
After grouping the two 3-nodes into a 6, nodes 4 and 5 are the next in line, which means that each new group formed won't contribute to its height. Most will, but not all, and that's the important fact.
A sequence using squares (note: squares as in the third sequence in the question, 1^2, 2^2, 3^2, 4^2, ..., N^2, not square diagram elements) has somewhat the same behavior as a sequence of 1-N, some of the time other elements than the one that was just formed will be used, which cuts down on the height:
As you can see here, the same happened to 36+49, it did not contribute to the height of the tree.
However, the fibonacci sequence is different. As you group the two lowest nodes, their sum will at most topple the next item but not more than one of them, which means that each new group being formed will be used in the next as well, so that each new group formed will contribute to the height of the tree. This is different from the other 3 examples.

Predict Huffman compression ratio without constructing the tree

I have a binary file and i know the number of occurrences of every symbol in it. I need to predict the length of compressed file IF i was to compress it using Huffman algorithm. I am only interested in the hypothetical output length and not in the codes for individual symbols, so constructing Huffman tree seems redundant.
As an illustration i need to get something like "A binary string of 38 bits which contains 4 a's, 5 b's and 10 c's can be compressed down to 28 bits.", except both the file and the alphabet size are much larger.
The basic question is: can it be done without constructing the tree?
Looking at the greedy algorithm: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
it seems the tree can be constructed in n*log(n) time where n is the number of distinct symbols in the file. This is not bad asymptotically, but requires memory allocation for tree nodes and does a lot of work which in my case goes to waste.
the lower bound on the average number of bits per symbol in compressed file is nothing but the entropy H = -sum(p(x)*log(p(x))) for all symbols x in input. P(x) = freq(x)/(filesize). Using this compressed length(lower bound) = filesize*H. This is the lower bound on compressed size of file. But unfortunately the optimal entropy is not achievable in most case because bits are integral not fractional so in practical case the huffman tree is needed to be constructed to get correct compression size. But optimal compression size can be used to get the upper bound on amount of compression possible and to decide whether to use huffman or not.
You can upper bound the average bit counts per symbol in Huffman coding by
H(p1, p2, ..., pn) + 1 where H is entropy, each pi is probability of symbol i occuring in input. If you multiply this value by input size N it will give you approximated length of encoded output.
You could easily modify the algorithm to build the binary tree in an array. The root node is at index 0, its left node at index 1 and right node at index 2. In general, a node's children will be at (index*2) + 1 and (index * 2) + 2. Granted, this still requires memory allocation, but if you know how many symbols you have you can compute how many nodes will be in the tree. So it's a single array allocation.
I don't see where the work that's done really goes to waste. You have to keep track of the combining logic somehow, and doing it in a tree as shown is pretty simple. I know that you're just looking for the final answer--the length of each symbol--but you can't get that answer without doing the work.

Efficient algorithm for eliminating nodes in "graph"?

Suppose I have a a graph with 2^N - 1 nodes, numbered 1 to 2^N - 1. Node i "depends on" node j if all the bits in the binary representation of j that are 1, are also 1 in the binary representation of i. So, for instance, if N=3, then node 7 depends on all other nodes. Node 6 depends on nodes 4 and 2.
The problem is eliminating nodes. I can eliminate a node if no other nodes depend on it. No nodes depend on 7; so I can eliminate 7. After eliminating 7, I can eliminate 6, 5, and 3, etc. What I'd like is to find an efficient algorithm for listing all the possible unique elimination paths. (that is, 7-6-5 is the same as 7-5-6, so we only need to list one of the two). I have a dumb algorithm already, but I think there must be a better way.
I have three related questions:
Does this problem have a general name?
What's the best way to solve it?
Is there a general formula for the number of unique elimination paths?
Edit: I should note that a node cannot depend on itself, by definition.
Edit2: Let S = {s_1, s_2, s_3,...,s_m} be the set of all m valid elimination paths. s_i and s_j are "equivalent" (for my purposes) iff the two eliminations s_i and s_j would lead to the same graph after elimination. I suppose to be clearer I could say that what I want is the set of all unique graphs resulting from valid elimination steps.
Edit3: Note that elimination paths may be different lengths. For N=2, the 5 valid elimination paths are (),(3),(3,2),(3,1),(3,2,1). For N=3, there are 19 unique paths.
Edit4: Re: my application - the application is in statistics. Given N factors, there are 2^N - 1 possible terms in statistical model (see http://en.wikipedia.org/wiki/Analysis_of_variance#ANOVA_for_multiple_factors) that can contain the main effects (the factors alone) and various (2,3,... way) interactions between the factors. But an interaction can only be present in a model if all sub-interactions (or main effects) are present. For three factors a, b, and c, for example, the 3 way interaction a:b:c can only be in present if all the constituent two-way interactions (a:b, a:c, b:c) are present (and likewise for the two-ways). Thus, the model a + b + c + a:b + a:b:c would not be allowed. I'm looking for a quick way to generate all valid models.
It seems easier to think about this in terms of sets: you are looking for families of subsets of {1, ..., N} such that for each set in the family also all its subsets are present. Each such family is determined by the inclusion-wise maximal sets, which must be overlapping. Families of pairwise overlapping sets are called Sperner families. So you are looking for Sperner families, plus the union of all the subsets in the family. Possibly known algorithms for enumerating Sperner families or antichains in general are useful; without knowing what you actually want to do with them, it's hard to tell.
Thanks to #FalkHüffner's answer, I saw that what I wanted to do was equivalent to finding monotonic Boolean functions for N arguments. If you look at the figure on the Wikipedia page for Dedekind numbers (http://en.wikipedia.org/wiki/Dedekind_number) the figure expresses the problem graphically. There is an algorithm for generating monotonic Boolean functions (http://www.mathpages.com/home/kmath094.htm) and it is quite simple to construct.
For my purposes, I use the algorithm, then eliminate the first column and last row of the resulting binary arrays. Starting from the top row down, each row has a 1 in the ith column if one can eliminate the ith node.
Thanks!
You can build a "heap", in which at depth X are all the nodes with X zeros in their binary representation.
Then, starting from the bottom layer, connect each item to a random parent at the layer above, until you get a single-component graph.
Note that this graph is a tree, i.e., each node except for the root has exactly one parent.
Then, traverse the tree (starting from the root) and count the total number of paths in it.
UPDATE:
The method above is bad, because you cannot just pick a random parent for a given item - you have a limited number of items from which you can pick a "legal" parent... But I'm leaving this method here for other people to give their opinion (perhaps it is not "that bad").
In any case, why don't you take your graph, extract a spanning-tree (you can use Prim algorithm or Kruskal algorithm for finding a minimal-spanning-tree), and then count the number of paths in it?

How to efficiently find a contiguous range of used/free slots from a Fenwick tree

Assume, that I am tracking the usage of slots in a Fenwick tree. As an example, lets consider tracking 32 slots, leading to a Fenwick tree layout as shown in the image below, where the numbers in the grid indicate the index in the underlying array with counts manipulated by the Fenwick tree where the value in each cell is the sum of "used" items in that segment (i.e. array cell 23 stores the amount of used slots in the range [16-23]). The items at the lowest level (i.e. cells 0, 2, 4, ...) can only have the value of "1" (used slot) or "0" (free slot).
What I am looking for is an efficient algorithm to find the first range of a given number of contiguous free slots.
To illustrate, suppose I have the Fenwick tree shown in the image below in which a total of 9 slots are used (note that the light gray numbers are just added for clarity, not actually stored in the tree's array cells).
Now I would like to find e.g. the first contiguous range of 10 free slots, which should find this range:
I can't seem to find an efficient way of doing this, and it is giving me a bit of a headache. Note, that as the required amount of storage space is critical for my purposes, I do not wish to extend the design to be a segment tree.
Any thoughts and suggestions on an O(log N) type of solution would be very welcome.
EDIT
Time for an update after bounty period has expired. Thanks for all comments, questions, suggestions and answers. They have made me think things over again, taught me a lot and pointed out to me (once again; one day I may learn this lesson) that I should focus more on the issue I want to solve when asking questions.
Since #Erik P was the only one that provided a reasonable answer to the question that included the requested code/pseudo code, he will receive the bounty.
He also pointed out correctly that O(log N) search using this structure is not going to be possible. Kudos to #DanBjorge for providing a proof that made me think about worst case performance.
The comment and answer of #EvgenyKluev made me realize I should have formulated my question differently. In fact I was already doing in large part what he suggested (see https://gist.github.com/anonymous/7594508 - which shows where I got stuck before posting this question), and asked this question hoping there would be an efficient way to search contiguous ranges, thereby preventing changing this design to a segment tree (which would require an additional 1024 bytes). It appears however that such a change might be the smart thing to do.
For anyone interested, a binary encoded Fenwick tree matching the example used in this question (32 slot fenwick tree encoded in 64 bits) can be found here: https://gist.github.com/anonymous/7594245.
I think the easiest way to implement all the desired functionality with O(log N) time complexity and at the same time minimize memory requirements is using a bit vector to store all 0/1 (free/used) values. Bit vector can substitute 6 lowest levels of both Fenwick tree and segment tree (if implemented as 64-bit integers). So height of these trees may be reduced by 6 and space requirements for each of these trees would be 64 (or 32) times less than usual.
Segment tree may be implemented as implicit binary tree sitting in an array (just like a well-known max-heap implementation). Root node at index 1, each left descendant of node at index i is placed at 2*i, each right descendant - at 2*i+1. This means twice as much space is needed comparing to Fenwick tree, but since tree height is cut by 6 levels, that's not a big problem.
Each segment tree node should store a single value - length of the longest contiguous sequence of "free" slots starting at a point covered by this node (or zero if no such starting point is there). This makes search for the first range of a given number of contiguous zeros very simple: start from the root, then choose left descendant if it contains value greater or equal than required, otherwise choose right descendant. After coming to some leaf node, check corresponding word of bit vector (for a run of zeros in the middle of the word).
Update operations are more complicated. When changing a value to "used", check appropriate word of bit vector, if it is empty, ascend segment tree to find nonzero value for some left descendant, then descend the tree to get to rightmost leaf with this value, then determine how newly added slot splits "free" interval into two halves, then update all parent nodes for both added slot and starting node of the interval being split, also set a bit in the bit vector. Changing a value to "free" may be implemented similarly.
If obtaining the number of nonzero items in some range is also needed, implement Fenwick tree over the same bit vector (but separate from the segment tree). There is nothing special in Fenwick tree implementation except that adding together 6 lowest nodes is substituted by "population count" operation for some word of the bit vector. For an example of using Fenwick tree together with bit vector see first solution for Magic Board on CodeChef.
All necessary operations for bit vector may be implemented pretty efficiently using various bitwise tricks. For some of them (leading/trailing zero count and population count) you could use either compiler intrinsics or assembler instructions (depending on target architecture).
If bit vector is implemented with 64-bit words and tree nodes - with 32-bit words, both trees occupy 150% space in addition to bit vector. This may be significantly reduced if each leaf node corresponds not to a single bit vector word, but to a small range (4 or 8 words). For 8 words additional space needed for trees would be only 20% of bit vector size. This makes implementation slightly more complicated. If properly optimized, performance should be approximately the same as in variant for one word per leaf node. For very large data set performance is likely to be better (because bit vector computations are more cache-friendly than walking the trees).
As mcdowella suggests in their answer, let K2 = K/2, rounding up, and let M be the smallest power of 2 that is >= K2. A promising approach would be to search for contiguous blocks of K2 zeroes fully contained in one size-M block, and once we've found those, check neighbouring size-M blocks to see if they contain sufficient adjacent zeroes. For the initial scan, if the number of 0s in a block is < K2, clearly we can skip it, and if the number of 0s is >= K2 and the size of the block is >= 2*M, we can look at both sub-blocks.
This suggests the following code. Below, A[0 .. N-1] is the Fenwick tree array; N is assumed to be a power of 2. I'm assuming that you're counting empty slots, rather than nonempty ones; if you prefer to count empty slots, it's easy enough to transform from the one to the other.
initialize q as a stack data structure of triples of integers
push (N-1, N, A[n-1]) onto q
# An entry (i, j, z) represents the block [i-j+1 .. i] of length j, which
# contains z zeroes; we start with one block representing the whole array.
# We maintain the invariant that i always has at least as many trailing ones
# in its binary representation as j has trailing zeroes. (**)
initialize r as an empty list of pairs of integers
while q is not empty:
pop an entry (i,j,z) off q
if z < K2:
next
if FW(i) >= K:
first_half := i - j/2
# change this if you want to count nonempty slots:
first_half_zeroes := A[first_half]
# Because of invariant (**) above, first_half always has exactly
# the right number of trailing 1 bits in its binary representation
# that A[first_half] counts elements of the interval
# [i-j+1 .. first_half].
push (i, j/2, z - first_half_zeroes) onto q
push (first_half, j/2, first_half_zeroes) onto q
else:
process_block(i, j, z)
This lets us process all size-M blocks with at least K/2 zeroes in order. You could even randomize the order in which you push the first and second half onto q in order to get the blocks in a random order, which might be nice to combat the situation where the first half of your array fills up much more quickly than the latter half.
Now we need to discuss how to process a single block. If z = j, then the block is entirely filled with 0s and we can look both left and right to add zeroes. Otherwise, we need to find out if it starts with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the previous block ends with a suitable number of zeroes. Similarly, we check if the block ends with >= K/2 contiguous zeroes, and if so with how many exactly, and then check if the next block starts with a suitable number of zeroes. So we will need a procedure to find the number of zeroes a block starts or ends with, possibly with a shortcut if it's at least a or at most b. To be precise: let ends_with_zeroes(i, j, min, max) be a procedure that returns the number of zeroes that the block from [i-j+1 .. j] ends with, with a shortcut to return max if the result will be more than max and min if the result will be less than min. Similarly for starts_with_zeroes(i, j, min, max).
def process_block(i, j, z):
if j == z:
if i > j:
a := ends_with_zeroes(i-j, j, 0, K-z)
else:
a := 0
if i < N-1:
b := starts_with_zeroes(i+j, j, K-z-a-1, K-z-a)
else:
b := 0
if b >= K-z-a:
print "Found: starting at ", i - j - a + 1
return
# If the block doesn't start or end with K2 zeroes but overlaps with a
# correct solution anyway, we don't need to find it here -- we'll find it
# starting from the adjacent block.
a := starts_with_zeroes(i, j, K2-1, j)
if i > j and a >= K2:
b := ends_with_zeroes(i-j, j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - j - a + 1
# Since z < 2*K2, and j != z, we know this block doesn't end with K2
# zeroes, so we can safely return.
return
a := ends_with_zeroes(i, j, K2-1, j)
if i < N-1 and a >= K2:
b := starts_with_zeroes(i+j, K-a-1, K-a)
if b >= K-a:
print "Found: starting at ", i - a + 1
Note that in the second case where we find a solution, it may be possible to move the starting point left a bit further. You could check for that separately if you need the very first position that it could start.
Now all that's left is to implement starts_with_zeroes and ends_with_zeroes. In order to check that the block starts with at least min zeroes, we can test that it starts with 2^h zeroes (where 2^h <= min) by checking the appropriate Fenwick entry; then similarly check if it starts with 2^H zeroes where 2^H >= max to short cut the other way (except if max = j, it is trickier to find the right count from the Fenwick tree); then find the precise number.
def starts_with_zeroes(i, j, min, max):
start := i-j
h2 := 1
while h2 * 2 <= min:
h2 := h2 * 2
if A[start + h2] < h2:
return min
# Now h2 = 2^h in the text.
# If you insist, you can do the above operation faster with bit twiddling
# to get the 2log of min (in which case, for more info google it).
while h2 < max and A[start + 2*h2] == 2*h2:
h2 := 2*h2
if h2 == j:
# Walk up the Fenwick tree to determine the exact number of zeroes
# in interval [start+1 .. i]. (Not implemented, but easy.) Let this
# number be z.
if z < j:
h2 := h2 / 2
if h2 >= max:
return max
# Now we know that [start+1 .. start+h2] is all zeroes, but somewhere in
# [start+h2+1 .. start+2*h2] there is a one.
# Maintain invariant: the interval [start+1 .. start+h2] is all zeroes,
# and there is a one in [start+h2+1 .. start+h2+step].
step := h2;
while step > 1:
step := step / 2
if A[start + h2 + step] == step:
h2 := h2 + step
return h2
As you see, starts_with_zeroes is pretty bottom-up. For ends_with_zeroes, I think you'd want to do a more top-down approach, since examining the second half of something in a Fenwick tree is a little trickier. You should be able to do a similar type of binary search-style iteration.
This algorithm is definitely not O(log(N)), and I have a hunch that this is unavoidable. The Fenwick tree simply doesn't give information that is that good for your question. However, I think this algorithm will perform fairly well in practice if suitable intervals are fairly common.
One quick check, when searching for a range of K contiguous slots, is to find the largest power of two less than or equal to K/2. Any K continuous zero slots must contain at least one Fenwick-aligned range of slots of size <= K/2 that is entirely filled with zeros. You could search the Fenwick tree from the top for such chunks of aligned zeros and then look for the first one that can be extended to produce a range of K contiguous zeros.
In your example the lowest level contains 0s or 1s and the upper level contains sums of descendants. Finding stretches of 0s would be easier if the lowest level contained 0s where you are currently writing 1s and a count of the number of contiguous zeros to the left where you are currently writing zeros, and the upper levels contained the maximum value of any descendant. Updating would mean more work, especially if you had long strings of zeros being created and destroyed, but you could find the leftmost string of zeros of length at least K with a single search to the left branching left where the max value was at least K. Actually here a lot of the update work is done creating and destroying runs of 1,2,3,4... on the lowest level. Perhaps if you left the lowest level as originally defined and did a case by case analysis of the effects of modifications you could have the upper levels displaying the longest stretch of zeros starting at any descendant of a given node - for quick search - and get reasonable update cost.
#Erik covered a reasonable sounding algorithm. However, note that this problem has a lower complexity bound of Ω(N/K) in the worst-case.
Proof:
Consider a reduced version of the problem where:
N and K are both powers of 2
N > 2K >= 4
Suppose your input array is made up of (N/2K) chunks of size 2K. One chunk is of the form K 0s followed by K 1s, every other chunk is the string "10" repeated K times. There are (N/2K) such arrays, each with exactly one solution to the problem (the beginning of the one "special" chunk).
Let n = log2(N), k = log2(K). Let us also define the root node of the tree as being at level 0 and the leaf nodes as being at level n of the tree.
Note that, due to our array being made up of aligned chunks of size 2K, level n-k of the tree is simply going to be made up of the number of 1s in each chunk. However, each of our chunks has the same number of 1s in it. This means that every node at level n-k will be identical, which in turn means that every node at level <= n-k will also be identical.
What this means is that the tree contains no information that can disambiguate the "special" chunk until you start analyzing level n-k+1 and lower. But since all but 2 of the (N/K) nodes at that level are identical, this means that in the worst case you'll have to examine O(N/K) nodes in order to disambiguate the solution from the rest of the nodes.

Resources