I have a text file that's 664KB in size (using ls -l). To my understanding, a file cannot be compressed into anything smaller than the Shannon Source Coding limit without incurring information loss.
I used a program here to calculate the average shannon entropy of the textfile (4.36) and multiplied it by its number of characters. I get 371KB.
Then, I used bzip2 which to my understanding is lossless, and found that it compressed the file to 171K. From what i understand, nothing can be compressed smaller than the shannon limit without losing information, so how can bzip compress a file losslessly that is smaller than it? Am I missing some information as to how the os is encoding the file perhaps?
The text file I used for this experiment is MIT Classic's The Republic by Plato.
The program I used to calculate shannon entropy is this one. It gave me the same result as another program I used to cross-check it.
It is true that generally we cannot compress better than the Shannon entropy (by assuming no loss), and it is true that all zip encoding are lossless.
However, a few points must be considered.
For the Shannon entropy (on the contrary of some kind of logarithm entropy), a statistical model is assumed for the information.
In some particular cases (not totally random, respecting some rules..), it may happen that no statistical model can perfectly handle all the a priori knowledge that we can have.
However, this is not the most important issue here. By looking at the code that you used, it appears that the only statistical information which is considered is the frequency of each character. This implicitly assumes that there is no correlation between the characters.
It is clear that it is a very restrictive assumption, certainly not valid for a text file.
It is clear that your compression algorithm is able to benefit from the correlation between adjacent characters.
It all depends on the model.
The model used in your calculation is that each byte value has some independent probability. This is called "Order 0", since the probability at each byte location depends on zero preceding bytes.
A more sophisticated model would use information from the preceding bytes to generate a probability distribution for the current byte. bzip2 makes use of the context of other bytes, as do all general-purpose lossless compressors such as gzip and xz.
By the way, pigz has a -H or --huffman option which does Order 0 compression (Huffman coding only), and will get close to the Order 0 Shannon limit you are computing.
Related
I have been studying LZW compression and there is one thing that I can't satisfy myself with which is that while building up the dictionary in LZW mostly its maximum limit is set to 4096 entries. Why is that ?. Also if dictionary gets full then the dictionary is reset but what if the next few characters about to read were present in the dictionary before resetting the dictionary. Is this a limitation ? or my understanding is not correct?
The dictionary size is limited to the output symbol size. 12 bits can encode 4096 distinct values. It is a common choice for lecture notes and simple/assignment implementations.
However, ANY symbol bits more than the source bits can be used: a 16 bit symbol would allow 65k dictionary entries. The more bits, the more entries in the current dictionary can exist which can increase the “maximum” compression. Conversely, as each output symbol is larger it may decrease compression rates, especially for smaller inputs (insufficient time to generate a dictionary) and data that is more random (reduced ability to re-use the symbols in the dictionary). In practice, 19-20 bits seems to be about the useful limit2, while 16 bit symbols naturally align to bytes.
It is also possible to have an adaptive symbol size based on log2 of the CURRENT number of mapped symbols1- but this benefit disappears as data size increases, as the dictionary quickly fills. It is also largely superseded by Huffman coding.
When the dictionary is "reset", it is effectively the same as compressing multiple chunks of data and appending the compressed output: the dictionaries are separate. However, the data can be “split” dynamically based on when it fills the dictionary as opposed to, say, every X bytes of input. Since the symbol size is fixed, it is more efficient to ensure the dictionary is filled up before making the decision.
The primary purpose to reset a dictionary is to avoid “fixating” the symbols to data characteristics in one part of the input that might not be true for later data. A compressor can use a single non-resetting dictionary, reset the dictionary as soon as it is full, reset the dictionary when it’s full and a drop in compression is encountered, etc.: the goal is to achieve the highest compression within the domain/parameters.
Many LZ77/LZ78/LZW variations (and optimizations they utilize) are briefly discussed in "On parsing optimality for dictionary-based text compression—the Zip case" by AlessioLangiu; these excerpts contain lots of juicy details to further additional research.
1"Improving LZW' by R. Nigel Horspool goes into some details on adaptive symbol sizes. Nigel's "The Effect of Non-Greedy Parsing
in Ziv-Lempel Compression Methods" paper also includes a summary on compress's handling of dictionary resets.
2"The Relative Efficiency of Data Compression by LZW and LZSS" by Yair Wiseman includes a sample graph of symbols sizes vs. compression efficiency. The graph is highly dependent on the data.
I know most compression methods rely on some data repeatings in order to be effective. For example the sting "AAAAAaaaQWERTY" can be represented as "5A3aQWERTY" for lossless and something like "8aqwerty" for lossy (these are just for example, not actual working methods). As far as I know, all compression algorithms count on the repeats of ->constant<- strings of characters.
Here comes the problem with the string "abcdefghijklmnopqrstuvwxyz". Here nothing repeats, but as you probably see the information in the string can be represented in far shorter manner. In regex-like str. will be "[a-z]", or maybe "for(x=0;x<25;++){ascii(97+x)}".
Consider also the string "0149162536496481100121" - it can be represented by "for(x=0;x<11;++){x*x}".
The string "ABEJQZer" can be represented by "for(x=0;8;++){ascii(64+x*x)}"
The last two were examples of knowing an algorithm, which can reproduce the original string. I know that in general algorithms (if they are efficient) take far lesser space than the data they can produce.
like in svg images (which have only algorithms in the file) the size is lesser than jpeg.
My question is is there a way of compression, which takes the data and tryes to find efficient algorithms which can represent it. Like vectorizing an raster image ( like in http://vectormagic.com/), which works with other data too.
Consider audio data (for it can be compressed lossy) - some audio edditors (audacity for example) project files contain information like "generate 120Hz constant frequency with 0.8 amplitude from time 0 to time 2 minutes 45.6 seconds" (audacity stores info in xml format). This metadata takes very little memory, and when the project is exported to wav or mp3, the program "renders" the information to actual samples in the exported format.
In that case the compressor should reverse the process of rendering. It should take wav or mp3 file, figure out what algorithms can represent the samples (if it's lossy the algorithms must produce some approximation of the samples - like vectormagic.com approxymates the image) and produce compressed file.
I understand that compression time will be unbelievably long, but are there such (or similar) compression algorithms ?
All compression methods are like that: the output is a set of parameters for a set algorithms that renders the input, or something similar to the input.
For example MP3 audio codec breaks the input into blocks of 576 samples and converts each block into frequency-amplitude space, and prunes what cannot be heard by a human being. The output is equivalent to "during the next 13 milliseconds play frequencies x,y,z with amplitudes a,b,c". This woks well for audio data, and the similar approach used in JPEG works well for photographic images.
Similar methods can be applied to cases you mention. Sequences such as 987654 or 010409162536 for example are generated by successive values from polynomials, and can be represented as the coefficients of that polynomial, the first one as (9, -1) for 9-x, and the second one as (1,2,1) for 1+2x+x².
The choice of algorithm(s) used to generate the input tends to be fixed for simplicity, and tailored for the use case. For example if you are processing photographic images taken with a digital camera there's little point in even attempting to produce a vectorial output.
When trying to losslessly compress some data you always start with creating a model, for example when compressing some text in a human language, you assume, that there are actually not so many words, which repeat over and over. But then, many algorithms try to learn the parameters of the model on the go. Like it doesn't rely on what these words will actually be, it tries to find them for a given input. So the algorithm doesn't rely on the actual language used, but it does rely on the fact, that it is actually a human language, which follows some patterns.
In general, there isn't any perfect algorithm, which can compress anything losslessly, it is mathematically proven. For any algorithm there exist some data for which the compression result is bigger, than the data itself.
You can try data de-duplication:http://en.m.wikipedia.org/wiki/Data_deduplication. It's a little different and more intelligent data compression.
I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.
Question about Byte-Pairing for data compression. If byte pairing converts two byte values to a single byte value, splitting the file in half, then taking a gig file and recusing it 16 times shrinks it to 62,500,000. My question is, is byte-pairing really efficient? Is the creation of a 5,000,000 iteration loop, to be conservative, efficient? I would like some feed back on and some incisive opinions please.
Dave, what I read was:
"The US patent office no longer grants patents on perpetual motion machines, but has recently granted at least two patents on a mathematically impossible process: compression of truly random data."
I was not inferring the Patent Office was actually considering what I am inquiring about. I was merely commenting on the notion of a "mathematically impossible process." If someone has, in some way created a method of having a "single" data byte as a placeholder of 8 individual bytes of data, that would be a consideration for a patent. Now, about the mathematically impossibility of an 8 to 1 compression method, it is not so much a mathematically impossibility, but a series of rules and conditions that can be created. As long as there is the rule of 8 or 16 bit representation of storing data on a medium, there are ways to manipulate data that mirrors current methods, or creation by a new way of thinking.
In general, "recursive compression" as you have described it is a mirage: compression doesn't actually work that way.
First, you should realize that all compression algorithms have the potential to expand the input file instead of compressing it. You can demonstrate this by a simple counting argument: note that the compressed version of any file must be different from the compressed version of any other file (or you will not be able to decompress that file properly). Also, for any file size N, there is a fixed number of possible files of size <=N. If any files of size > N are compressible to size <= N, then an equal number of files of size <= N must expand to size >N when "compressed".
Second, "truly random" files are uncompressible. Compression works because the compression algorithm expects to receive files with certain kinds of predictable regularities. However, "truly random" files are by definition unpredictable: every random file is as likely as every other random file of the same length, so they don't compress.
Effectively, you have a model which treats some files as more likely than others; to compress such files, you want to choose shorter output files for the input files which are more likely. Information theory tells us the most efficient way to compress files is to assign each input file of probability P an output file of length ~ log2(1/P) bits. This means that, ideally, every output file of a given length has roughly equal probability, just like "truly random" files.
Among completely random files of a given length, each has probability (0.5)^(#original bits). The optimal length from above is ~ log2(1/ 0.5^(#original bits) ) = (#original bits) -- which is to say, the original length is the best you can do.
Because the output of a good compression algorithm is nearly random, re-compressing the compressed file will get you little to no gain. Any further improvements are effectively "leakage" due to suboptimal modeling and encoding; also, compression algorithms tend to scramble any regularity they don't take advantage of, making further compression of such "leakage" more difficult.
For a much longer exposition on this topic, with many examples of failed propositions of this type, see the comp.compression FAQ. Claims of "recursive compression" feature prominently.
Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.