Predict Huffman compression ratio without constructing the tree - algorithm

I have a binary file and i know the number of occurrences of every symbol in it. I need to predict the length of compressed file IF i was to compress it using Huffman algorithm. I am only interested in the hypothetical output length and not in the codes for individual symbols, so constructing Huffman tree seems redundant.
As an illustration i need to get something like "A binary string of 38 bits which contains 4 a's, 5 b's and 10 c's can be compressed down to 28 bits.", except both the file and the alphabet size are much larger.
The basic question is: can it be done without constructing the tree?
Looking at the greedy algorithm: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
it seems the tree can be constructed in n*log(n) time where n is the number of distinct symbols in the file. This is not bad asymptotically, but requires memory allocation for tree nodes and does a lot of work which in my case goes to waste.

the lower bound on the average number of bits per symbol in compressed file is nothing but the entropy H = -sum(p(x)*log(p(x))) for all symbols x in input. P(x) = freq(x)/(filesize). Using this compressed length(lower bound) = filesize*H. This is the lower bound on compressed size of file. But unfortunately the optimal entropy is not achievable in most case because bits are integral not fractional so in practical case the huffman tree is needed to be constructed to get correct compression size. But optimal compression size can be used to get the upper bound on amount of compression possible and to decide whether to use huffman or not.

You can upper bound the average bit counts per symbol in Huffman coding by
H(p1, p2, ..., pn) + 1 where H is entropy, each pi is probability of symbol i occuring in input. If you multiply this value by input size N it will give you approximated length of encoded output.

You could easily modify the algorithm to build the binary tree in an array. The root node is at index 0, its left node at index 1 and right node at index 2. In general, a node's children will be at (index*2) + 1 and (index * 2) + 2. Granted, this still requires memory allocation, but if you know how many symbols you have you can compute how many nodes will be in the tree. So it's a single array allocation.
I don't see where the work that's done really goes to waste. You have to keep track of the combining logic somehow, and doing it in a tree as shown is pretty simple. I know that you're just looking for the final answer--the length of each symbol--but you can't get that answer without doing the work.

Related

Huffman code with very large variety of elements

I am trying to implement the Huffman algorithm, as taught in my math class. However, I noticed that in the worst-case scenario, the produced code would be larger than the actual charset encoding.
Currently I only have flowcharts to show. This is strictly a theoretical question.
The issue is that the tree building algorithm does not guarantee a balanced tree where the leaves are evenly distributed, thus minimizing the tree height and therefore, minimizing the produced codeword lengths.
For values of the element frequencies follow a sequence such as Fibonacci's
N1 = 1, N2 = 2, Ni = Ni-1 + Ni-2.
Nk-1 < Nk < Nk+1 is always true.
when using the following algorithm to build the tree, n - 1 tree levels are needed, where n is the number of elements present:
pop the smaller two elements from list
make a new node
assign both popped nodes to either branch of the new node
the weight of the new node is the sum of the weight of its two branches
insert new node in list at the appropriate location according to weight
repeat until only one node remains
Given that Huffman codewords have a 1:1 relation to the height at which an element is present you would potentially need more bits than the un-compressed encoded character.
Imagine encoding a file written in English. Most of the times there will be more than 10 different characters present. These characters would use 8 bits in ascii or utf-8. However, to encode them using Huffman, in the worst-case scenario you would need up to 9bits. which is worse than the original encoding.
This problem would be exacerbated with larger sets or if you used combinations of characters where the number of possible combinations would be C(n, k), where n is the number of elements present and k is the number of characters to be represented by one codeword.
Obviously, there is something I am missing. I would dearly appreciate an explanation on how to solve this, or links to quality resources where I can learn more. Thank you.
Worst-case scenario:
/\
A \
/\
B \
/\
C …
/\
Y Z

Given a continuous stream of words, remove the duplicates

I was asked this question recently.
Given a continuous stream of words, remove the duplicates while reading the input.
Example:
Input: This is next stream of question see it is a question
Output: This next stream of see it is a question
Starting from end, question as well as is already appeared once, so the second time it's ignored.
My solution:
Use hashing in this scenario for each word coming through stream.
If there is a collision then then ignore that word.
It's definitely not a good solution. I was asked to optimize it.
What is the best approach to solve this problem?
Hashing isn't a particularly bad solution.
It gives expected O(wordLength) lookup time, but O(wordLength * wordCount) in the worst case, and uses O(maxWordLength * wordCount) space.
Alternatives:
Trie
A trie is a tree data structure where each edge corresponds to a letter and the path from the root defines the value of the node.
This will give O(wordLength) lookup time and uses O(wordCount * maxWordLength) space, although the actual space usage may be lower as repeated prefixes (e.g. te in the below example) only use space once.
Binary search tree
A binary search tree is a tree data structure where each node in the subtree rooted at the left child is smaller than its parent, and similarly all nodes to the right are greater.
A self-balancing one gives O(wordLength * log wordCount) lookup time and uses O(wordCount * maxWordLength) space.
Bloom filter
A bloom filter is a data structure consisting of some number of bits and a few hash functions which maps a word to a bit, sets the output of each hash function on add and checks if any are not set on query.
This uses less space than the above solutions, but at the cost of false positives - some words will be marked as duplicates that aren't.
Specifically, it uses 1.44 log2(1/e) bits per key, where e is the false positive rate, giving O(wordCount) space usage, but with an incredibly low constant factor.
This will give O(wordLength) lookup time.
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

Binary search for no uniform distribution

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.
Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.
There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.
One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.
Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.
Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.
Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.
A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.
Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.
The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.
One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.
A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.
Let me make it precise. What you want for binary search is:
Given array A which is sorted, but have non-uniform distribution
Given left & right index L & R of search range
Want to search for a value X in A
To apply binary search, we want to find the index M in [L,R]
as the next position to look at.
Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]
In general, you of course want to pick M where you think X value should be in A.
Because even if you miss, half the total 'chance' would be eliminated.
So it seems to me you have some expectation about distribution.
If you could tell us what exactly do you mean by '1/x distribution', then
maybe someone here can help build on my suggestion for you.
Let me give a worked example.
I'll use similar interpretation of '1/x distribution' as #Leonid Volnitsky
Here is a Python code that generate the input array A
from random import uniform
# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()
# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]
Let assume the value to search for is:
X = 0.0553
Then the estimated index of X is:
= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)
So how to calculate P(x <= X) ?
It this case it is simple.
We reverse X back to the value between [a,b] which we will call
X' = 1/X ~ 18
Hence
P(x <= X) = (b-X')/(b-a)
= (20-18)/(20-10)
= 2/10
So the expected position of X is:
10*(2/10) = 2
Well, and that's pretty damn accurate!
To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.
I know this might not seems like a binary search anymore
if you can get that close to the answer in just one step.
But admit it, this is what you can do if you know the distribution of input array.
The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.
You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.
This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.
Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:
30
/ \
20 50
/ / \
10 40 60
Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:
f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]
Into a new search tree, that looks like:
10 -------------> the most probable of being taken
/ \ leaving v = [[20, 30], [40, 50, 60]]
20 30 ---------> the most probable of being taken
/ \ leaving v = [[40, 50], [60]]
40 50 -------> the most probable of being taken
/ leaving v = [[60]]
60
If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by #Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.
I will assume from your description:
X is uniformly distributed
Y=1/X is your data which you want to search and it is stored in sorted table
given value y, you need to binary search it in the above table
Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.
For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.
To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.
Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.
The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).
In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).
Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N).
We are looking for key K and want index I so that
[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.
Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).
The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.
Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).
As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).
If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].
The above proportionally weighted binary search can used to improve on the initial probe.

Finding Most Compressible Vector Within Given Bounds?

I've reduced a compression problem I am working on to the following:
You are given as input two n-length vectors of floating point values:
float64 L1, L2, ..., Ln;
float64 U1, U2, ..., Un;
Such that for all i
0.0 <= Li <= Ui <= 1.0
(By the way, n is large: ~10^9)
The algorithm takes L and U as input and uses them to generate a program.
When executed the generated program outputs an n-length vector X:
float64 X1, X2, ..., Xn;
Such that for all i:
L1 <= Xi <= Ui
The generated program can output any such X that fits these bounds.
For example a generated program could simply store L as an array and output it. (Notice this would take 64n bits of space to store L and then a little extra for the program to output it)
The goal is that the generated program (including data) as small as possible, given L and U.
For example suppose that it happens that every element of L was less than 0.3 and every element of U was greater than 0.4 than the generated program could just be:
for i in 1 to n
output 0.35
Which would be tiny.
Can anyone suggest a strategy, algorithm or architecture to tackle this?
This simple heuristic is very fast and should provide very good compression if the bounds allow for a very good compression:
Prepare an arbitrary (virtual) binary search tree over all candidate values. float64s share the sorting order with signed int64s, so you can arbitrarily prefer (have nearer to the root) the values with more trailing zeroes.
For each pair of bounds
start at the root.
While the current node is larger than both bounds OR smaller than both bounds,
descend down the tree.
append the current node into the vector.
For the tree mentioned above, this means
For each pair of bounds
find the (unique) number within the specified range that has as few significant bits as possible. That is, find the first bit where both bounds differ; set it to 1 and all following bits to 0; if the bit that's set to 1 is the sign bit, set it to 0 instead.
Then you can feed this to a deflateing library to compress (and build a self-extracting archive).
A better compression might be possible to achieve if you analyse the data and build a different binary search tree. Since the data set is very large and arrives as a stream of data, it might not be feasible, but this is one such heuristic:
while the output is not fully defined
find any value that fits within the most undecided-for bounds:
sort all bounds together:
bounds with lower value sort before bounds with higher value.
lower bounds sort before upper bounds with the same value.
indistinguishable bounds are grouped together.
calculate the running total of open intervals.
pick the largest total. Either the upper or the lower bound will do. You could even try to make a "smart choice" by splitting the interval with the least amount of significant bits.
set this value as the output for all positions where it can be used.
Instead of recalculating the sort order, you could cache the sort order and only remove from that, or even cache the running total as well (or switch from recalculating the running total to caching the running total at runtime). This does not change the result, only improve the running time.

Compression of sequence of integers providing random access

I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.
I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.
Any help or pointers would be appreciated.
Thanks in advance.
PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.
If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.
If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.
If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.
If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.
The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.
The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:
log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)
Applying Stirling's approximation (which is good here only if n and f are large), yields:
~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)
What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.
The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.
If you work out the number of possible different combinations and take its log base 2 you can find the best possible compression, and I don't think it will be that great in your case. With 16 numbers of frequency 1 the number of possible messages is 16! and Excel tells me log base 2 of 16! is 44.25, whereas storing them as 4-bit codes would only take 64 bits. (where there is more than one of each kind you want http://mathworld.wolfram.com/MultinomialCoefficient.html)
I think you will have a problem mixing random access into this because the only information you have is that there are fixed numbers of each type of element - in the whole sequence. That's not a lot of information for the whole sequences, and it says almost nothing about the first half of the sequence in isolation, because you could well have more of some number in the first half and less in the second half.

Resources