Can adding data to a file compress it by more? - algorithm

Let's say I have a 10MB file that I can compress to 5MB. Are there situations where you could add data to the file and cause it to compress to smaller than 5MB?
Edit: And just to be clear by adding data to the file I mean appending data, not adding to the middle of the file.

To your original question:
Yes.
Notice that every 5MB file appears as a noncontiguous bitwise subsequence of the file (01)^(41943040), which is twice the size and compresses very well. Indeed, you can pick exactly one bit from each of the 41943040 01's.
The trick here is that, from the "padded" file, you can't recover the original. All 5MB files "pad" to the same string of bits.
To your revised question:
Still "yes," but it's close enough to "no" for all practical purposes. I don't think this can happen with gzip. However, if you compress using a Burrows-Wheeler transform followed by run-length encoding, appending to the string can alter the order of the string's cyclic shifts and thereby, by pure coincidence, result in a string that compresses better.

With dictionary type compression, the are models that have an entry for every prefix of any given entry (LZW, for one), and others that do not (e.g., LZMW).
If for the latter an extension of, say, a file of 10^7 bytes can use an entry with a cheaper encoding than any encoding of the prefix to the end of the original file, the compressed file will be shorter.
(With LZMW and 10^7 identical bytes, I expect an entry for two bytes after encoding the first two, another for four, eight, … 2^22. Appending (3*2^22 - 10^7) allows finishing with just one more "2^22 code" - codes for 10^7 bytes left as an …)

Related

How to build unique file names based on its content?

I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.
The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?
I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.
I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?
Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1
For 500K files, that's about 1/2N-37
Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.
If that won't do, then 24 chars gives 1/283.
If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.
In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).
You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).
To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).
I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.

Hex - Search by bytes to get Offset & Search by Offset to get Bytes

just currently prototyping a little software and currently stuck. I'm trying to create a little program that'll edit a .bin file, and for this I will need to do the following:
Get Bytes by Searching for Offset
Get Offset by searching for Bytes
Write/Update .bin file
I usually use the program HxD to do this manually, but want to get a small automated process in place.
Using hex.EncodeToString returns what I want as the output (Like HxD) however I can't find a way to search for the values by bytes and offests
Could anyone help or have suggestions?
OK, "searching of an offset" is a misnomer because if you have an offset and a medium which supports random access, you just "seek" the known offset there; for files, see os.File.Seek.
Searching is more complex: it consists of converting the user input into something searchable and, well, the searching itself.
Conversion is the process of translation of the human operator's imput to a slice of bytes — for instance, you'd need to convert a string "00 87" to a slice of bytes, []byte{00, 87}.
Such conversion can be done using, say, encoding/hex.Decode after removing any whitespace, which can be done using a multitude of ways.
Searching the file given a slice of bytes can be either simple of complex.
If a file is small (a couple megabytes, on today's hardware), you can just slurp it into memory (for instance, using io.ReadAll) and do a simple search using bytes.Index.
If a file is big, the complexity of the task quickly escalates.
For instance, you could read the file from its beginning to its end using chunks of some sensible size and search for your byte slice in each of them.
But you'd need to watch out for two issues: the slice to search should be smaller than each of such chunks, and two adjacent chunks might contain the sequence to be found positioned right across their "sides" — so that the Nth chunk contains the first part of the pattern at its end and the N+1th chunk contains the rest of it at its beginning.
There exist more advanced approaches to such searching — for instance, using so-called "memory-mapped files" but I'd speculate it's a bit too early to tread these lands, given your question.

How can I find the encryption/encoding method used on a file given its original version

I have two image files. One is a regular picture [here], while the other is it's foreignly modified counterpart retrieved from a remote server [here]. I have no idea what the server did to this image or the rest of them stored on the server, but they are obviously modified in some way because the second file cannot be read. They are both the exact same picture, even the same byte count, but I can't figure out how to reverse whatever was done. What should I be trying?
Note: The modified file was retrieved from a packet capture as an octet stream. I wrote the raw binary to a file and then base64 decoded it.
Unlike the original JPEG file, the encrypted data is very "random" — every byte value from 0 to 255 appears with almost exactly the same probability. This rules out the possibility of a transposition cipher.
Also, the files are exactly the same length (3,018,705 bytes), which makes it unlikely that a block cipher (like DES) was used.
So that makes a stream cipher (like RC4) the most likely candidate. If this is the case, you can obtain the keystream simply by XORing each byte of the two files together. However, you might find it difficult to figure out the cryptographic key from this data. Good luck with that :-)

optimizing byte-pair encoding

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?
This is a summary of my progress so far:
Googling found this little report that links to the original code and cites the source:
Philip Gage, titled 'A New Algorithm
for Data Compression', that appeared
in 'The C Users Journal' - February
1994 edition.
The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.
That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.
My test data is enwik8 from the Hutter Prize.
|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2 | 1.24 | //The current version in the large text benchmark
| bpe_c | 1.07 | //The original version by Gage, using a hashtable
| bpev3 | 0.25 | //Uses a list, custom sort, less memcpy
|----------------|-----------------|
bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.
As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.
The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.
Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.
Its sill multi-pass though. And so its not a generally competitive algorithm.
If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.
There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.
I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.
To speed up performance on byte-pair encoding:
Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).
Code in JustBasic can be found here complete with input text file.
Just BASIC Files Archive – forum post
EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding
EBPE features two post processes to Byte Pair Encoding
1. Is compressing the dictionary (believed to be a novelty)
A dictionary entry is composed of 3 bytes:
AA – the two char to be replaced by (byte pair)
1 – this single token (tokens are unused symbols)
So "AA1" tells us when decoding that every time we see a "1" in the
data file, replace it with "AA".
While long runs of sequential tokens are possible, let’s look at this
8 token example:
AA1BB3CC4DD5EE6FF7GG8HH9
It is 24 bytes long (8 * 3)
The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.
We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we
see a sequential run of 4 tokens or more, let’s modify our dictionary to be:
AA1BB3<255>CCDDEEFFGGHH<255>
Where the <255> tells us that the tokens for the byte pairs are implied and
are incremented by 1 more than the last token we saw (3). We increment
by one until we see the next <255> indicating an end of run.
The original dictionary was 24 bytes,
The enhanced dictionary is 20 bytes.
I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.
2. Is compressing the data file
Re-using rarely used characters as tokens is nothing new.
After using all of the symbols for compression (except for <255>),
we scan the file and find a single "j" in the file. Let this char do double
duty by:
"<255>j" means this is a literal "j"
"j" is now used as a token for re-compression,
If the j occurred 1 time in the data file, we would need to add 1 <255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.
If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary
entry so we need to save more than 9 bytes in BPE for this to be worth it.
Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.
Note: When decompressing make sure not to decompress every "j".
One needs to look at the prior character to make sure it is not a <255> in order
to decompress. Finally, after all decompression, go ahead and remove the <255>'s
to recreate your original file.
3. What’s next in EBPE?
Unknown at this time
I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
Number of byte-pair occurrences in chunk of data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!
Yes, keep us posted.
guarantee?
BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.
EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)
You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.
Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text
files and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.
Here is a new BPE(http://encode.ru/threads/1874-Alba).
Example for compile,
gcc -O1 alba.c -o alba.exe
It's faster than default.
There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.
the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.

Algorithm to find string matches in a sliding window

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)
My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.
The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.
I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.
Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.
Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.
(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)
Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.
You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.
I think you're describing a modified version of the Longest Common Substring Problem.

Resources