Why gzip does not support splitting - hadoop

I got one paragraph from the book Hadoop: The Definitive Guide, it is as follows:
"DEFLATE
stores data as a series of compressed blocks. The problem is that the start of each block
is not distinguished in any way that would allow a reader positioned at an arbitrary
point in the stream to advance to the beginning of the next block, thereby synchronizing
itself with the stream. For this reason, gzip does not support splitting."
My question is I cannot understand the reason the author has explained about why gzip does not support splitting. Can someone give me a more detail explanation about this?
As my understanding, if the big file is split to 16 blocks. When one mapper begin to read one block, and in this point, 2 situations may happen:
The mapper cannot the block
or it can read it and then process it but does not know where to put the result to the whole stream
Does one of the above situations will happen or none will happen and there is other logic?

In order to split a file into pieces for processing, you need two things:
The pieces need to be able to be processed independently.
You need to be able to find where to split the pieces.
The deflate format in its normal usage supports neither. For 1: the deflate format is inherently serial, with every match referring to previously uncompressed data, itself potentially coming from a similar back reference, perhaps all the way to the beginning of the file.
The description you quote doesn't mention that important point.
Though it is a moot point since you don't have 1, for 2: deflate has no apparent markers in the stream to identify block boundaries. To find block boundaries, you would have to decode all of the bits up to the boundary, which would defeat the purpose of splitting the file for independent processing.
That is the point mentioned in your quoted description.
Though this is all true for a normal deflate stream, not prepared for splitting, you can if you like prepare such a deflate stream. The history can be erased at select breakpoints using Z_FULL_FLUSH, which allows independent decompression from that point. It also inserts a visible marker 00 00 ff ff. That's not a very long marker, and could appear by accident in the compressed data. It could be followed by a second flush to insert a second marker giving nine bytes: 00 00 ff ff 00 00 00 ff ff. That is something that Hadoop could use the split the deflate stream.

Related

How smaz compression library works?

I'm currently working for a short text compression project based on my language. But as a beginner, I also know some basic compression algorithm like LZW. But I still don't understand how smaz works. I have 2 questions:
How does smaz work?
How to build the codebook and reversed codebook?
Can any one explain it for me?
Thank you very much.
trying to answer your questions
How does smaz work?
according [1],
Smaz has a hard-wired constant built-in codebook of 254 common English
words, word fragments, bigrams, and the lowercase letters (except j,
k, q). The inner loop of the Smaz decoder is very simple:
Fetch the next byte X from the compressed file.
Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded
text.
Repeat until there are no more compressed bytes left in the compressed file.
Because the codebook is constant, the Smaz decoder is unable to
"learn" new words and compress them, no matter how often they appear
in the original text.
This page could be helpful to understand the code.
How to build the codebook and reversed codebook?
TODO file in repository and author comments in redit poitns that the dictionary was generated by a unreleased ruby script. Also, the author explains:
btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a
table of frequencies, than adjust the weight based on the string
length, and finally hand tuning the table to compress specific things
very well. I added by hand the "http://" and ".com" token for example,
removing the final two entries.
An alternative to your project could be the shoco library which supports generation of a custom compression model based on your language.
The smaz sources is only 178 lines and just 99 lines without comments and codebook tables. You should look to see how it works.
Smaz is pretty simple compression by codebook (like LZW which you know). The library contains table with most popular terms in english (lines 5 - 51 for compression table and 56 -76 for decompression) and replace this terms with indexes in compressed string. And contrary to decompress.
For example, string the end would compressed by 58% becouse if terms the would be one byte index in compression table. So 7 bytes lenght string became 4 bytes length string.

How can I find the encryption/encoding method used on a file given its original version

I have two image files. One is a regular picture [here], while the other is it's foreignly modified counterpart retrieved from a remote server [here]. I have no idea what the server did to this image or the rest of them stored on the server, but they are obviously modified in some way because the second file cannot be read. They are both the exact same picture, even the same byte count, but I can't figure out how to reverse whatever was done. What should I be trying?
Note: The modified file was retrieved from a packet capture as an octet stream. I wrote the raw binary to a file and then base64 decoded it.
Unlike the original JPEG file, the encrypted data is very "random" — every byte value from 0 to 255 appears with almost exactly the same probability. This rules out the possibility of a transposition cipher.
Also, the files are exactly the same length (3,018,705 bytes), which makes it unlikely that a block cipher (like DES) was used.
So that makes a stream cipher (like RC4) the most likely candidate. If this is the case, you can obtain the keystream simply by XORing each byte of the two files together. However, you might find it difficult to figure out the cryptographic key from this data. Good luck with that :-)

DEFLATE Decoding

I am currently reading about the DEFLATE method for encoding/decoding data. I understand that the process is composed of two parts:
i. Replace duplicate information (within a specified window) with a reference back to the previous identical piece.
ii. Use Huffman coding to reduce the size of the most commonly occurring symbols.
I have a question with regards to (i). DEFLATE uses LZ77 which, based on a size window, searches through the information and, if it finds any duplicate information, replaces it with a "pointer". That makes perfect sense.
However, when decoding using LZ77 how does DEFLATE recognize a pointer? (Pointers are length-distance pairs; how can you discern if it's a pointer or just a number that was present in the initial data?)
Reference: http://en.wikipedia.org/wiki/DEFLATE#Duplicate_string_elimination
It's recommended to read the Deflate RFC 1951 specification, which is much more precise, and answer such questions.
What you'll see in => 3.2.5. Compressed blocks (length and distance codes)
"the literal and length alphabets are merged into a single alphabet"
which means that, by simply retrieving the next symbol, you immediately know if it is a literal (0..255), or a match length (257..285), or even an end of block (256). In case of a match length, a reference (offset) must be decoded too. Offset are encoded using a separate tree.

optimizing byte-pair encoding

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?
This is a summary of my progress so far:
Googling found this little report that links to the original code and cites the source:
Philip Gage, titled 'A New Algorithm
for Data Compression', that appeared
in 'The C Users Journal' - February
1994 edition.
The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.
That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.
My test data is enwik8 from the Hutter Prize.
|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2 | 1.24 | //The current version in the large text benchmark
| bpe_c | 1.07 | //The original version by Gage, using a hashtable
| bpev3 | 0.25 | //Uses a list, custom sort, less memcpy
|----------------|-----------------|
bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.
As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.
The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.
Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.
Its sill multi-pass though. And so its not a generally competitive algorithm.
If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.
There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.
I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.
To speed up performance on byte-pair encoding:
Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).
Code in JustBasic can be found here complete with input text file.
Just BASIC Files Archive – forum post
EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding
EBPE features two post processes to Byte Pair Encoding
1. Is compressing the dictionary (believed to be a novelty)
A dictionary entry is composed of 3 bytes:
AA – the two char to be replaced by (byte pair)
1 – this single token (tokens are unused symbols)
So "AA1" tells us when decoding that every time we see a "1" in the
data file, replace it with "AA".
While long runs of sequential tokens are possible, let’s look at this
8 token example:
AA1BB3CC4DD5EE6FF7GG8HH9
It is 24 bytes long (8 * 3)
The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.
We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we
see a sequential run of 4 tokens or more, let’s modify our dictionary to be:
AA1BB3<255>CCDDEEFFGGHH<255>
Where the <255> tells us that the tokens for the byte pairs are implied and
are incremented by 1 more than the last token we saw (3). We increment
by one until we see the next <255> indicating an end of run.
The original dictionary was 24 bytes,
The enhanced dictionary is 20 bytes.
I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.
2. Is compressing the data file
Re-using rarely used characters as tokens is nothing new.
After using all of the symbols for compression (except for <255>),
we scan the file and find a single "j" in the file. Let this char do double
duty by:
"<255>j" means this is a literal "j"
"j" is now used as a token for re-compression,
If the j occurred 1 time in the data file, we would need to add 1 <255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.
If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary
entry so we need to save more than 9 bytes in BPE for this to be worth it.
Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.
Note: When decompressing make sure not to decompress every "j".
One needs to look at the prior character to make sure it is not a <255> in order
to decompress. Finally, after all decompression, go ahead and remove the <255>'s
to recreate your original file.
3. What’s next in EBPE?
Unknown at this time
I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
Number of byte-pair occurrences in chunk of data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!
Yes, keep us posted.
guarantee?
BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.
EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)
You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.
Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text
files and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.
Here is a new BPE(http://encode.ru/threads/1874-Alba).
Example for compile,
gcc -O1 alba.c -o alba.exe
It's faster than default.
There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.
the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.

Algorithm to find string matches in a sliding window

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)
My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.
The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.
I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.
Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.
Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.
(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)
Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.
You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.
I think you're describing a modified version of the Longest Common Substring Problem.

Resources