Huffman canonical algorithm. Storing the code table - algorithm

I implement the canonical Huffman algorithm and there are several questions on the theoretical part, namely about storing of information for decoding. As a method, it's proposed to transmit alphabetical characters and lengths of their canonical codes together with encoded data, because for restoring the canonical table, we only need the code lengths.
Example: string "bbbaacd". Canonical codes: b 0 (1 bit) a 10 (2) c 110 (3) d 111 (3) i.e. decoding data: b1a2c3d3. This raises several questions.
1)Is it necessary to transfer this table in one file along with encoded data according to the table (at the end / beginning of the file)? Are there any real examples?
2)If yes, then if there are numbers in the data, how to understand where in our table is the alphabet symbol (number), and where is the number of bits (length)?
3)And finally, how to understand where is the border between the table and the encoded data?
If everything is stored in separate files (which, in my opinion, is simpler and more logical), then the last 2 questions disappear by themselves.

Normally the code lengths are stored in the same file as the data. If you set a maximum code length of 16 bits, for example, then the first 128 bytes in your file can be the lengths of the codes for all 256 bytes (4 bits each).
A slightly more sophisticated way would be to Huffman-encode those lengths. You can use a fixed tree for that... or you can use a dynamic tree and first write out the lengths of the symbols for each length.

Related

How smaz compression library works?

I'm currently working for a short text compression project based on my language. But as a beginner, I also know some basic compression algorithm like LZW. But I still don't understand how smaz works. I have 2 questions:
How does smaz work?
How to build the codebook and reversed codebook?
Can any one explain it for me?
Thank you very much.
trying to answer your questions
How does smaz work?
according [1],
Smaz has a hard-wired constant built-in codebook of 254 common English
words, word fragments, bigrams, and the lowercase letters (except j,
k, q). The inner loop of the Smaz decoder is very simple:
Fetch the next byte X from the compressed file.
Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded
text.
Repeat until there are no more compressed bytes left in the compressed file.
Because the codebook is constant, the Smaz decoder is unable to
"learn" new words and compress them, no matter how often they appear
in the original text.
This page could be helpful to understand the code.
How to build the codebook and reversed codebook?
TODO file in repository and author comments in redit poitns that the dictionary was generated by a unreleased ruby script. Also, the author explains:
btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a
table of frequencies, than adjust the weight based on the string
length, and finally hand tuning the table to compress specific things
very well. I added by hand the "http://" and ".com" token for example,
removing the final two entries.
An alternative to your project could be the shoco library which supports generation of a custom compression model based on your language.
The smaz sources is only 178 lines and just 99 lines without comments and codebook tables. You should look to see how it works.
Smaz is pretty simple compression by codebook (like LZW which you know). The library contains table with most popular terms in english (lines 5 - 51 for compression table and 56 -76 for decompression) and replace this terms with indexes in compressed string. And contrary to decompress.
For example, string the end would compressed by 58% becouse if terms the would be one byte index in compression table. So 7 bytes lenght string became 4 bytes length string.

What are the design decisions behind Google Maps encoded polyline algorithm format?

Several Google Maps products have the notion of polylines, which in terms of underlying data is basically just a sequence of lat/lng points that might for example manifest in a line drawn on a map. The Google Map developer libraries make use of an encoded polyline format that churns out an ASCII string representing the points making up the polyline. This encoded format is then typically decoded with a built in function of the Google libraries or a function written by a third party that implements the decoding algorithm.
The algorithm for encoding polyline points is described in the Encoded Polyline Algorithm Format document. What is not described is the rationale for implementing the algorithm this way, and the significance of each of the individual steps. I'm interested to know whether the thinking/purpose behind implementing the algorithm this way is publicly described anywhere. Two example questions:
Do some of the steps have a quantifiable impact on compression and how does this impact vary as a function of the delta between points?
Is the summing of values with ASCII 63 a compatibility hack of some sort?
But just in general, a description to go along with the algorithm explaining why the algorithm is implemented the way it is.
Update: This blog post from James Snook also has the 'valid ascii' range argument and reads logically for other steps I wondered. E.g. the left shifting before storing which makes place for the negative bit as the first bit.
Some explanations I found, not sure if everything is 100% correct.
One double value is stored in multiple 5 bits chunks and 0x20 (binary '0010 0000') is used as indication that the next 5 bit entry belongs to the current double.
0x1f (binary '0001 1111') is used as bit mask to throw away other bits
I expect that 5 bits are used because the delta of lat or lons are in this range. So that every double value takes only 5 bits on average when done for a lot of examples (but not verified yet).
Now, compression is done by assuming nearby double values are very close and creating the difference is nearly 0, so that the results fits in a few bytes. Then this result is stored in a dynamic fashion: store 5 bits and if the value is longer mark with 0x20 and store the next 5 bits and so on. So I guess you can tweak the compression if you try 6 or 4 bits but I guess 5 is a practically reasonable choice.
Now regarding the magic 63, this is 0x3f and binary 0011 1111. I'm not sure why they add it. I thought that adding 63 will give some 'better' asci characters (e.g. allowed in XML or in URL) as we skip e.g. 62 which is > but 63 which is ? is really better? At least the first ascii chars are not displayable and have to be avoided. Note that if one would use 64 then one would hit the ascii char 127 for the maximum value of 31 (31+64+32) and this char is not defined in html4. Or is because of a signed char is going from -128 to 127 and we need to store the negative numbers as positive, thus adding the maximum possible negative number?
Just for me: here is a link to an official Java implementation with Apache License

Why Huffman Coding is good?

I am not asking how Huffman coding is working, but instead, I want to know why it is good.
I have the following two questions:
Q1
I understand the ultimate purpose of Huffman coding is to give certain char a less bit number, so space is saved. What I don't understand is that why the decision of number of bits for a char can be related to the char's frequency?
Huffman Encoding Trees says
It is sometimes advantageous to use variable-length codes, in which
different symbols may be represented by different numbers of bits. For
example, Morse code does not use the same number of dots and dashes
for each letter of the alphabet. In particular, E, the most frequent
letter, is represented by a single dot.
So in Morse code, E can be represented by a single dot because it is the most frequent letter. But why? Why can it be a dot just because it is most frequent?
Q2
Why the probability / statistics of the chars are so important to Huffman coding?
What happen if the statistics table is wrong?
If you assign less number or bits or shorter code words for most frequently used symbols you will be saving a lot of storage space.
Suppose you want to assign 26 unique codes to English alphabet and want to store an english novel ( only letters ) in term of these code you will require less memory if you assign short length codes to most frequently occurring characters.
You might have observed that postal code and STD codes for important cities are usually shorter ( as they are used very often ). This is very fundamental concept in Information theory.
Huffman encoding gives prefix codes.
Construction of Huffman tree:
A greedy approach to construct Huffman tree for n characters is as follows:
places n characters in n sub-trees.
Starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.
Do this until you get a single tree.
For example consider below binary tree where E and T have high weights ( as very high occurrence )
It is a prefix tree. To get the Huffman code for any character, start from the node corresponding to the the character and backtrack till you get the root node.
Indeed, an E could be, say, three dashes followed by two dots. When you make your own encoding, you get to decide. If your goal is to encode a certain text so that the result is as short as possible, you should choose short codes for the most frequent characters. The Huffman algorithm ensures that we get the optimal codes for a specific text.
If the frequency table is somehow wrong, the Huffman algorithm will still give you a valid encoding, but the encoded text would be longer than it could have been if you had used a correct frequency table. This is usually not a problem, because we usually create the frequency table based on the actual text that is to be encoded, so the frequency table will be "perfect" for the text that we are going to encode.
well.. you want assign shorter codes to the symbols which appear more frequently... huffman encoding works just by this simple assumption.. :-)
you compute the frequency of all symbols, sort them all, and start assigning bit codes to each one.. the more frequent a symbol is, the shorter the code you'll assign to it.. simple as this.
the big question is: how large the window in which we compute such frequencies should be? should it be as large as the entire file? or should it be smaller? and if the latter apply, how large? Most huffman encoding have some sort of "test-run" in which they estimate the best window size a little bit like TCP/IP do with its windows frame sizes.
Huffman codes provide two benefits:
they are space efficient given some corpus
they are prefix codes
Given some set of documents for instance, encoding those documents as Huffman codes is the most space efficient way of encoding them, thus saving space. This however only applies to that set of documents as the codes you end up are dependent on the probability of the tokens/symbols in the original set of documents. The statistics are important because the symbols with the highest probability (frequency) are given the shortest codes. Thus the symbols most likely to be in your data use the least amount of bits in the encoding, making the coding efficient.
The prefix code part is useful because it means that no code is the prefix of another. In morse code for instance A = dot dash and J = dot dash dash dash, how do you know where to break reading the code. This increases the inefficiency of transmitting data using morse as you need a special symbol (pause) to signify the end of transmission of one code. Compare that to Huffman codes where each code is unique, as soon as you discover the encoding for a symbol in the input, you know that that is the transmitted symbol because it is guaranteed not to be the prefix of some other symbol.
It's the dual effect of having the most frequent characters using the shortest bit sequences that gives you the savings.
For a concrete example, let's say you have a piece of text that consists of 1024 e characters and 1024 of all other characters combined.
With 8 bits for code, that's a full 2048 bytes used in uncompressed form.
Now let's say we represent e as a single 1-bit and every other letter as a 0-bit followed by its original 8 bits (a very primitive form of Huffman).
You can see that half the characters have been expanded from 8 bits to 9, giving 9216 bits, or 1152 bytes. However, the e characters have been reduced from 8 bits to 1, meaning they take up 1024 bits, or 128 bytes.
The total bytes used is therefore 1152 + 128, or 1280 bytes, representing a compression ratio of 62.5%.
You can use a fixed encoding scheme based on the likely frequencies of characters (such as English text), or you can use adaptive Huffman encoding which changes the encoding scheme as characters are processed and frequencies are adjusted. While the former may be okay for input which has high probability of matching frequencies, the latter can adapt to any input.
Statistic table can't be wrong, because in general Huffman algorithm, analyze hole text at the beginning, and builds frequent-statistics of the given text, while Morse has a static symbol -code map.
Huffman algorithm uses the advantage of a given text. As an example, if E is most frequent letter in English in general, that doesn't mean that E is most frequent in a given text for a given author.
Another advantage of Huffman algorithm is that you can use it for any alphabet starting from [0, 1] finished Chinese hieroglyphs, while Morse is defined only for English letters
So in Morse code, "E" can be represented by a single dot, because it is the most frequent letter. But why? Why is it a dot because of its frequency?
"E" can be encoded to any unique code for a specific code dictionary, so it can be "0", we choose it to be short to save memory, so the average bytes used after encode is minimized.
Why is the probability / statistics of the chars so important to Huffman coding? What happens if the statistics table is wrong?
why do we encode? save space right? Space used after encode is freq(wordi)*Length(wordi), it is what we should try to minimize, so we choose to assign words with high prob short code greedly to save space.
If the statistics table is wrong, then the encoding is not the best way to save space.

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

What's the name of this algorithm/routine?

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

Resources