Compression/encryption algorithms output guarantees - algorithm

My question here regards compression/encryption algorithms in general and to me sounds like a complete noobie one. Now, I understand that "in general" "it all depends", but suppose we're talking algorithms that all have reference implementation/published specs and are overall ever so standard. To be more specific, I'm using .NET implementations of AES-256 and GZip/Deflate
So here goes. Can it be assumed that, given exactly the same input, both types of algorithms will produce exactly the same output.
For example, will output of aes(gzip("hello"), key, initVector)) on .NET be identical to that of on a Mac or Linux?

AES is rigourosly defined, so given same input, same algorithm, and same key, you will get the same output.
It cannot be said the same for zip.
The problem is not the standard. There IS a defined standard : Deflate stream is IETF RFC 1950, gzip stream is IETF RFC 1952, so anyone can produce a compatible zip compressor/decoder starting from these definitions.
But zip belong to the large family of LZ compressors, which, by construction, are neither bijective nor injective. Which means, from a single source, there are many many ways to describe the same input which are all valid although different.
An example.
Let's say, my input is : ABCABCABC
Valid outputs can be :
9 literals
3 literals followed by one copy of 6 bytes long starting at offset -3
3 literals followed by two copies of 3 bytes long each starting at offset -3
6 literals followed by one copy of 3 bytes long starting at offset -6
etc.
All these outputs are valid and describe (regenerate) the same input. Obviously, one of them is more efficient (compress more) than the others. But that's where implementation may differ. Some will be more powerful than others. For example, it is known that kzip and 7zip generate better (more compressed) zip files than gzip. Even gzip has a lot of compression options generating different compressed streams starting from a same input.
Now, if you want to constantly get exactly the same binary output, you need more than "zip" : you need to enforce a precise zip implementation, and a precise compression parameter. Then, you'll be sure that you generate always the same binary.

AES is defined to a standard, so any conforming implementation will indeed produce the same output. GZip is a program, so it is possible that different versions of the program will produce different outputs. I would expect a later version to be able to reinflate the output from an earlier version, but the reverse may not be possible.
As others have said, if you are going to compress, then compress the plaintext, not the cyphertext from AES. Cyphertext won't compress well as it is designed to appear random.

Related

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

Repetition-based, pattern-based data compression algorithm

Suppose I have the following string:
ABCADCADCADABC
I want to compress it by finding repeating substrings.
What's an algorithm that gives the optimal compression?
In the above example it should return
AB*1 CAD*3 ABC*1
For comparison, a greedy algorithm might return
ABC*1 ADC*2 AD*1 ABC*1
Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.
This sounds like a job for suffix arrays/trees!
http://en.wikipedia.org/wiki/Suffix_array
You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):
ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$
From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.
http://en.wikipedia.org/wiki/Suffix_tree
Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.
It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.
So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1
Better still would be:
ABCAD(6,3)(3,11)
where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.
This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

Is there a hash function for binary data which produces closer hashes when the data is more similar?

I'm looking for something like a hash function but for which it's output is closer the closer two different inputs are?
Something like:
f(1010101) = 0 #original hash
f(1010111) = 1 #very close to the original hash as they differ by one bit
f(0101010) = 9999 #not very close to the original hash they all bits are different
(example outputs for demonstration purposes only)
All of the input data will be of the same length.
I want to make comparisons between a file a lots of other files and be able to determine which other file has the fewest differences from it.
You may try this algorithm.
http://en.wikipedia.org/wiki/Levenshtein_distance
Since this is string only.
You may convert all your binary to string
for example:
0 -> "00000000"
1 -> "00000001"
You might be interested in either simhashing or shingling.
If you are only trying to detect similarity between documents, there are other techniques that may suit you better (like TF-IDF.) The second link is part of a good book whose other chapters delve into general information retrieval topics, including these other techniques.
You should not use a hash for this.
You must compute signatures containing several characteristic values like :
file name
file size
Is binary / Is ascii only
date (if needed)
some other more complex like :
variance of the values of bytes
average value of bytes
average length of same value bits sequence (in compressed files there are no long identical bit sequences)
...
Then you can compare signatures.
But the most important is to know what kind of data is in these files. If it is images, the size and main color are more important. If it is sound, you could analyse only some frequencies...
You might want to look at the source code to unix utilities like cmp or the FileCmp stuff in Python and use that to try to determine a reasonable algorithm.
In my uninformed opinion, calculating a hash is not likely to work well. First, it can be expensive to calculate a hash. Second, what you're trying to do sounds more like a job for encoding than a hash; once you start thinking of it that way, it's not clear that it's even worth transforming the file that way.
If you have some constraints, specifying them might be useful. For example, if all the files are the exact same length, that may simplify things. Or if you are only interested in differences between bits in the same position and not interested in things that are similar only if you compare bits in different positions (e.g., two files are identical, except that one has everything shifted three bits--should those be considered similar or not similar?).
You could calculate the population count of the XOR of the two files, which is exactly the number of bits that are not the same between the two files. So it just does precisely what you asked for, no approximations.
You can represent your data as a binary vector of features and then use dimensionality reduction either with SVD or with random indexing.
What you're looking for is a file fingerprint of sorts. For plain text, something like Nilsimsa (http://ixazon.dynip.com/~cmeclax/nilsimsa.html) works reasonably well.
There are a variety of different names for this type of technique. Fuzzy Hashing/Locality Sensitive Hashing/Distance Based Hashing/Dimensional reduction and a few others. Tools can generate a fixed length output or variable length output, but the outputs are generally comparable (eg by levenshtein distance) and similar inputs yield similar outputs.
The link above for nilsimsa gives two similar spam messages and here are the example outputs:
773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025 spam1
47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645 spam2
* * ** *** * ** ** ** ** * ******* **** ** * * *
Spamsum and sdhash are more useful for arbitrary binary data. There are also algorithms specifically for images that will work regardless of whether it's a jpg or a png. Identical images in different formats wouldn't be noticed by eg spamsum.

What's the name of this algorithm/routine?

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

Algorithm to find string matches in a sliding window

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)
My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.
The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.
I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.
Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.
Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.
(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)
Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.
You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.
I think you're describing a modified version of the Longest Common Substring Problem.

Resources