Repetition-based, pattern-based data compression algorithm - algorithm

Suppose I have the following string:
ABCADCADCADABC
I want to compress it by finding repeating substrings.
What's an algorithm that gives the optimal compression?
In the above example it should return
AB*1 CAD*3 ABC*1
For comparison, a greedy algorithm might return
ABC*1 ADC*2 AD*1 ABC*1

Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.

This sounds like a job for suffix arrays/trees!
http://en.wikipedia.org/wiki/Suffix_array
You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):
ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$
From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.
http://en.wikipedia.org/wiki/Suffix_tree
Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.

It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.
So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1

Better still would be:
ABCAD(6,3)(3,11)
where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.
This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

Related

Why Huffman Coding is good?

I am not asking how Huffman coding is working, but instead, I want to know why it is good.
I have the following two questions:
Q1
I understand the ultimate purpose of Huffman coding is to give certain char a less bit number, so space is saved. What I don't understand is that why the decision of number of bits for a char can be related to the char's frequency?
Huffman Encoding Trees says
It is sometimes advantageous to use variable-length codes, in which
different symbols may be represented by different numbers of bits. For
example, Morse code does not use the same number of dots and dashes
for each letter of the alphabet. In particular, E, the most frequent
letter, is represented by a single dot.
So in Morse code, E can be represented by a single dot because it is the most frequent letter. But why? Why can it be a dot just because it is most frequent?
Q2
Why the probability / statistics of the chars are so important to Huffman coding?
What happen if the statistics table is wrong?
If you assign less number or bits or shorter code words for most frequently used symbols you will be saving a lot of storage space.
Suppose you want to assign 26 unique codes to English alphabet and want to store an english novel ( only letters ) in term of these code you will require less memory if you assign short length codes to most frequently occurring characters.
You might have observed that postal code and STD codes for important cities are usually shorter ( as they are used very often ). This is very fundamental concept in Information theory.
Huffman encoding gives prefix codes.
Construction of Huffman tree:
A greedy approach to construct Huffman tree for n characters is as follows:
places n characters in n sub-trees.
Starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.
Do this until you get a single tree.
For example consider below binary tree where E and T have high weights ( as very high occurrence )
It is a prefix tree. To get the Huffman code for any character, start from the node corresponding to the the character and backtrack till you get the root node.
Indeed, an E could be, say, three dashes followed by two dots. When you make your own encoding, you get to decide. If your goal is to encode a certain text so that the result is as short as possible, you should choose short codes for the most frequent characters. The Huffman algorithm ensures that we get the optimal codes for a specific text.
If the frequency table is somehow wrong, the Huffman algorithm will still give you a valid encoding, but the encoded text would be longer than it could have been if you had used a correct frequency table. This is usually not a problem, because we usually create the frequency table based on the actual text that is to be encoded, so the frequency table will be "perfect" for the text that we are going to encode.
well.. you want assign shorter codes to the symbols which appear more frequently... huffman encoding works just by this simple assumption.. :-)
you compute the frequency of all symbols, sort them all, and start assigning bit codes to each one.. the more frequent a symbol is, the shorter the code you'll assign to it.. simple as this.
the big question is: how large the window in which we compute such frequencies should be? should it be as large as the entire file? or should it be smaller? and if the latter apply, how large? Most huffman encoding have some sort of "test-run" in which they estimate the best window size a little bit like TCP/IP do with its windows frame sizes.
Huffman codes provide two benefits:
they are space efficient given some corpus
they are prefix codes
Given some set of documents for instance, encoding those documents as Huffman codes is the most space efficient way of encoding them, thus saving space. This however only applies to that set of documents as the codes you end up are dependent on the probability of the tokens/symbols in the original set of documents. The statistics are important because the symbols with the highest probability (frequency) are given the shortest codes. Thus the symbols most likely to be in your data use the least amount of bits in the encoding, making the coding efficient.
The prefix code part is useful because it means that no code is the prefix of another. In morse code for instance A = dot dash and J = dot dash dash dash, how do you know where to break reading the code. This increases the inefficiency of transmitting data using morse as you need a special symbol (pause) to signify the end of transmission of one code. Compare that to Huffman codes where each code is unique, as soon as you discover the encoding for a symbol in the input, you know that that is the transmitted symbol because it is guaranteed not to be the prefix of some other symbol.
It's the dual effect of having the most frequent characters using the shortest bit sequences that gives you the savings.
For a concrete example, let's say you have a piece of text that consists of 1024 e characters and 1024 of all other characters combined.
With 8 bits for code, that's a full 2048 bytes used in uncompressed form.
Now let's say we represent e as a single 1-bit and every other letter as a 0-bit followed by its original 8 bits (a very primitive form of Huffman).
You can see that half the characters have been expanded from 8 bits to 9, giving 9216 bits, or 1152 bytes. However, the e characters have been reduced from 8 bits to 1, meaning they take up 1024 bits, or 128 bytes.
The total bytes used is therefore 1152 + 128, or 1280 bytes, representing a compression ratio of 62.5%.
You can use a fixed encoding scheme based on the likely frequencies of characters (such as English text), or you can use adaptive Huffman encoding which changes the encoding scheme as characters are processed and frequencies are adjusted. While the former may be okay for input which has high probability of matching frequencies, the latter can adapt to any input.
Statistic table can't be wrong, because in general Huffman algorithm, analyze hole text at the beginning, and builds frequent-statistics of the given text, while Morse has a static symbol -code map.
Huffman algorithm uses the advantage of a given text. As an example, if E is most frequent letter in English in general, that doesn't mean that E is most frequent in a given text for a given author.
Another advantage of Huffman algorithm is that you can use it for any alphabet starting from [0, 1] finished Chinese hieroglyphs, while Morse is defined only for English letters
So in Morse code, "E" can be represented by a single dot, because it is the most frequent letter. But why? Why is it a dot because of its frequency?
"E" can be encoded to any unique code for a specific code dictionary, so it can be "0", we choose it to be short to save memory, so the average bytes used after encode is minimized.
Why is the probability / statistics of the chars so important to Huffman coding? What happens if the statistics table is wrong?
why do we encode? save space right? Space used after encode is freq(wordi)*Length(wordi), it is what we should try to minimize, so we choose to assign words with high prob short code greedly to save space.
If the statistics table is wrong, then the encoding is not the best way to save space.

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

if i convert a file's contents into a single large number and express it as a mathematical expression, does it mean I have compressed the file?

assuming the mathematical expression has less number of characters than the original number.
example-
20880467999847912034355032910578 can be expressed as (23^23 +10)
this looks like a good compression method. Will it work for compressing large files?
UPDATE- i didn't mean converting a file into a large binary number. lets say i have a text file and i replace all the characters in it with their ascii values. now i have a large number in the decimal number system. i can express it as a mathematical expression like in the example above.
The notion you're looking for is Kolmogorov complexity - it's a measure of how algorithmically incompressible a number is. See this wiki article for a rigorous definition and examples of such numbers.
If you take the contents of a file as a large binary number, and find an expression which evaluates to that number and can be stored more compactly than the number itself, then yes, you have compressed the file.
Unfortunately, for most files, you'll never find such an expression.
Simple logic (see the link posted by #OliCharlesworth) should convince you that it's impossible to find such an expression for all or even most files. Even for files which might have a suitable expression, finding it will be very, very difficult. If you want to convince yourself of this, try this challenge:
Take the following ASCII string:
"Holy Kolmogorov complexity, Batman! Compress this sucker down good and you'll get a pretty penny, my fine lad!"
Interpreted as a binary number, with the high-order digits coming first, that is: 2280899635869589768629811602006623364651019118009864206881173103187172975244099647369151382436996220022807793898568915685059542016541775658916080587423284053601554008368389985872997499032440860090224967472423163775276043175694884234152335588829534778866153948275745.
Try to find a polynomial which evaluates to that number. All the numbers used must be integral, and the total number of decimal digits appearing in the polynomial must be less than 80. If you succeed, I will send you a small cash prize by PayPal.
Yes, by definition. You have correctly defined compression as representing something larger with something smaller.
How do you propose to do this? How often will that work? There's the rub.

Algorithm to Map Strings to Short Replacements

I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).
This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.
Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .
You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.
For example, the first string you ever process can be mapped to "a".
The next distinct string would be mapped to "b", etc.
Every time you process a string, you'd need to look it up to see if it has already been mapped.
"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).
If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).
Cheers.

Algorithm to find string matches in a sliding window

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)
My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.
The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.
I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.
Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.
Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.
(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)
Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.
You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.
I think you're describing a modified version of the Longest Common Substring Problem.

Resources