E.g. how can it tell that a 4GB text file can be compressed to, say, 200MB? Obviously, it doesn't read all of the contents in 2 or so seconds... so what kind of predictive algorithm(s) does it use?
They use variant of Prediction by partial matching (PPM) called PPMd.
Look at wiki
It takes usually -log(x) + log(2) bits to compress x bits. However this is a highly theoretical value and it depends heavenly on the data you want to compress. For your data you have to record each character and frequency and insert it in the formula. For example try only 3 character first. You want to look for shannon-code.
Related
I have a text file that's 664KB in size (using ls -l). To my understanding, a file cannot be compressed into anything smaller than the Shannon Source Coding limit without incurring information loss.
I used a program here to calculate the average shannon entropy of the textfile (4.36) and multiplied it by its number of characters. I get 371KB.
Then, I used bzip2 which to my understanding is lossless, and found that it compressed the file to 171K. From what i understand, nothing can be compressed smaller than the shannon limit without losing information, so how can bzip compress a file losslessly that is smaller than it? Am I missing some information as to how the os is encoding the file perhaps?
The text file I used for this experiment is MIT Classic's The Republic by Plato.
The program I used to calculate shannon entropy is this one. It gave me the same result as another program I used to cross-check it.
It is true that generally we cannot compress better than the Shannon entropy (by assuming no loss), and it is true that all zip encoding are lossless.
However, a few points must be considered.
For the Shannon entropy (on the contrary of some kind of logarithm entropy), a statistical model is assumed for the information.
In some particular cases (not totally random, respecting some rules..), it may happen that no statistical model can perfectly handle all the a priori knowledge that we can have.
However, this is not the most important issue here. By looking at the code that you used, it appears that the only statistical information which is considered is the frequency of each character. This implicitly assumes that there is no correlation between the characters.
It is clear that it is a very restrictive assumption, certainly not valid for a text file.
It is clear that your compression algorithm is able to benefit from the correlation between adjacent characters.
It all depends on the model.
The model used in your calculation is that each byte value has some independent probability. This is called "Order 0", since the probability at each byte location depends on zero preceding bytes.
A more sophisticated model would use information from the preceding bytes to generate a probability distribution for the current byte. bzip2 makes use of the context of other bytes, as do all general-purpose lossless compressors such as gzip and xz.
By the way, pigz has a -H or --huffman option which does Order 0 compression (Huffman coding only), and will get close to the Order 0 Shannon limit you are computing.
I know most compression methods rely on some data repeatings in order to be effective. For example the sting "AAAAAaaaQWERTY" can be represented as "5A3aQWERTY" for lossless and something like "8aqwerty" for lossy (these are just for example, not actual working methods). As far as I know, all compression algorithms count on the repeats of ->constant<- strings of characters.
Here comes the problem with the string "abcdefghijklmnopqrstuvwxyz". Here nothing repeats, but as you probably see the information in the string can be represented in far shorter manner. In regex-like str. will be "[a-z]", or maybe "for(x=0;x<25;++){ascii(97+x)}".
Consider also the string "0149162536496481100121" - it can be represented by "for(x=0;x<11;++){x*x}".
The string "ABEJQZer" can be represented by "for(x=0;8;++){ascii(64+x*x)}"
The last two were examples of knowing an algorithm, which can reproduce the original string. I know that in general algorithms (if they are efficient) take far lesser space than the data they can produce.
like in svg images (which have only algorithms in the file) the size is lesser than jpeg.
My question is is there a way of compression, which takes the data and tryes to find efficient algorithms which can represent it. Like vectorizing an raster image ( like in http://vectormagic.com/), which works with other data too.
Consider audio data (for it can be compressed lossy) - some audio edditors (audacity for example) project files contain information like "generate 120Hz constant frequency with 0.8 amplitude from time 0 to time 2 minutes 45.6 seconds" (audacity stores info in xml format). This metadata takes very little memory, and when the project is exported to wav or mp3, the program "renders" the information to actual samples in the exported format.
In that case the compressor should reverse the process of rendering. It should take wav or mp3 file, figure out what algorithms can represent the samples (if it's lossy the algorithms must produce some approximation of the samples - like vectormagic.com approxymates the image) and produce compressed file.
I understand that compression time will be unbelievably long, but are there such (or similar) compression algorithms ?
All compression methods are like that: the output is a set of parameters for a set algorithms that renders the input, or something similar to the input.
For example MP3 audio codec breaks the input into blocks of 576 samples and converts each block into frequency-amplitude space, and prunes what cannot be heard by a human being. The output is equivalent to "during the next 13 milliseconds play frequencies x,y,z with amplitudes a,b,c". This woks well for audio data, and the similar approach used in JPEG works well for photographic images.
Similar methods can be applied to cases you mention. Sequences such as 987654 or 010409162536 for example are generated by successive values from polynomials, and can be represented as the coefficients of that polynomial, the first one as (9, -1) for 9-x, and the second one as (1,2,1) for 1+2x+x².
The choice of algorithm(s) used to generate the input tends to be fixed for simplicity, and tailored for the use case. For example if you are processing photographic images taken with a digital camera there's little point in even attempting to produce a vectorial output.
When trying to losslessly compress some data you always start with creating a model, for example when compressing some text in a human language, you assume, that there are actually not so many words, which repeat over and over. But then, many algorithms try to learn the parameters of the model on the go. Like it doesn't rely on what these words will actually be, it tries to find them for a given input. So the algorithm doesn't rely on the actual language used, but it does rely on the fact, that it is actually a human language, which follows some patterns.
In general, there isn't any perfect algorithm, which can compress anything losslessly, it is mathematically proven. For any algorithm there exist some data for which the compression result is bigger, than the data itself.
You can try data de-duplication:http://en.m.wikipedia.org/wiki/Data_deduplication. It's a little different and more intelligent data compression.
Say, I have a number of strings which are quite similar but no absolutely identical.
They can differ more or less, but similarity can be seen by the naked eye.
All lengths are equal, each is 256 bytes. The total number of strings is less than 2^16.
What would be the best compression method for such case?
UPDATE (data format):
I can't share the data but I can describe it quite close to reality:
Imagine the notation (like LOGO language) which is the sequence of commands for some device for moving and drawing on plane. Such as:
U12 - move up 12 steps
D64 - move down 64 steps
C78 - change drawing color to 78
P1 - pen down (start drawing)
and so on.
The whole vocabulary of this language doesn't exceed the size of English alphabet.
The string then describes a whole picture: "U12C6P1L74D74R74U74P0....".
Imagine now the class of ten thousand children who were told to draw some very specific image with the help of this language: like the flag of their country. We will get 10K of strings which are all different and all alike at the same time.
Our task is compress the whole bunch of strings as good as possible.
My suspicion here is that there is a way to exploit this similarity and common length of the strings, while, Huffman e.g. wont use it explicitly.
Could you tell us what's the data ? Maybe like a DNA sequence ? Like
AGCTGTGCGAGAGAGAGCGGTGGG...
GGCTGTGCGAGCGAGAGCGGTGGG...
CGCTGTGAGAGNGAGAGCGGTGGG...
NGCTGTGCGAGAGAGAGCGGTGGG...
GGCTGTGCGAGTGAGAGCGGTGGG...
... ...
?
Maybe or not. Anyway here is two levels or two ways to think:
Huffman coding : ref. wikipedia by yourself
Stringology : ref. http://books.google.com.hk/books/about/Jewels_of_stringology.html?id=9NdohJXtIyYC
I think it's easy to solve your problem but hard to choose the best way. You can design several method to compare by using http://en.wikipedia.org/wiki/Data_compression and more tools .
Since you have a fix width of 256 bytes and it's a power of 2 I would try a burrow-wheeler transformation or a move-to-front algorithm with that size or maybe the double of that size. Then you can try a huffman code. Maybe you can try a hilbert curve on 256 bytes and then a bwt and mft?
"The total number of strings is less than 2^16." This is a small, bounded number, which makes your job very easy: Why don't you keep a lookup table (hash table) of all strings previously seen. You can then convert every line of 256 bytes into a two-byte index into this lookup table.
You then have a sequence of 16-bit integers. These integers will contains patterns like "after the pen went down, there is a 90% chance that the next command is to start to draw". If the data contains patterns like this, PPM is your choice. 7-zip has a high-quality PPM-implementation. You can choose it using the GUI or cmd-line.
I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.
Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.