Finding partly similar files in large archive - algorithm

I have an archive of about 100 million binary files. New files get added regularly. The file sizes range from about 0.1 MB to about 800 MB.
I can easily determine if files are probably completely identical by comparing their sizes and if the sizes match, by comparing the hashes of the files.
I want to find files that have partly similar content. With that I mean that I believe they have some parts that are identical and some parts that can be different.
What is the best, or any realistic way to find which files are similar to which other files, and if possible get some measure of how similar they are?
Edit:
The files are mostly executables.
They are similar if, say, somewhere between 10% and 100% of their contents are the same as the contents of another file. The lower limit could also be set to 50%. The exact lower limit is not important.
I guess some form of hashing would be needed for this comparison to be doable over such an archive.

It depends on how you will be determining similarity, if for example you could determine similarity by comparing just the first 100 bytes of each file then I guess this would be achievable but to find a particular string comparison in 100 million files that can be 800MB large would be quite infeasible.

Not an easy problem. The first step is to map each file into a set of hashes, i.e., integers. Ideally you want to do that by computing the hashes of a set of substrings in each file such that the substrings are uniformly distributed throughout the file but also the likelihood that a substring occurs in dissimilar files is rare. For example, if the files were English text you could choose to split the file into substrings at all the most common English words (the, to, be, of, and, ...). To do that with the executables I would first compute what the most common byte pairs or triples of all the files are and choose the top N to split the files that hopefully generate substrings that are "not too long." Just what "not to long" is with executables is something don't have a good idea of.
Once you hash those substrings you have the problem of finding similar sets, which is called the set similarity joins problem in computer science. See my post here for methods/code to solve that problem. Good luck!

Related

Why isn't random byte comparison a good method of testing equality?

I have two 50G+ files I want to compare for equality.
'diff -a' or 'cmp' would work, but are slow.
Hashing both files and comparing the hashes would be faster(?), but
still fairly slow.
Instead, suppose I randomly selected 10,000 numbers between 1 and 50G,
and compared those specific bytes in the two files, using seek() for speed.
I claim the chance 10,000 randomly selected bytes will match in the
two files by coincidence is about 256^10000 to 1 (or about 1 in
10^2408).
This makes it orders of magnitude better than any known hash function,
and much faster.
So, what's wrong with this argument? Why isn't random byte testing
superior to hashing?
This question inspired by:
What is the fastest way to check if files are identical?
(where I suggest a similar, but slightly different method)
What happens if you have an accidental bit flip somewhere in there? Even just one would be enough to make your checks fail
Your odds calculation is only true if the two files themselves contain random bytes, which is almost certainly not the case. Two large files of the same size on the same system are very likely to be highly correlated. For example, on my system now there are three files of the same size in 8GB range--they are raw dumps of SD cards representing different versions of the same software, so it is likely that only a few hundred bytes of them are different. The same would apply to, say, two database snapshots from consecutive days.
Because large files differing by only a few bytes is a very possible--indeed likely--case, you really have no choice but to read every byte of both. Hashing will at least save you from comparing every byte.
One thing you might be able to do is access the blocks in each file in a pre-determined pseudo-ramdom order to maximize the likelihood of finding the small patch of difference and being able to abort early on failure.

Does the order of data in a text file affects its compression ratio?

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.

Byte-Pairing for data compression

Question about Byte-Pairing for data compression. If byte pairing converts two byte values to a single byte value, splitting the file in half, then taking a gig file and recusing it 16 times shrinks it to 62,500,000. My question is, is byte-pairing really efficient? Is the creation of a 5,000,000 iteration loop, to be conservative, efficient? I would like some feed back on and some incisive opinions please.
Dave, what I read was:
"The US patent office no longer grants patents on perpetual motion machines, but has recently granted at least two patents on a mathematically impossible process: compression of truly random data."
I was not inferring the Patent Office was actually considering what I am inquiring about. I was merely commenting on the notion of a "mathematically impossible process." If someone has, in some way created a method of having a "single" data byte as a placeholder of 8 individual bytes of data, that would be a consideration for a patent. Now, about the mathematically impossibility of an 8 to 1 compression method, it is not so much a mathematically impossibility, but a series of rules and conditions that can be created. As long as there is the rule of 8 or 16 bit representation of storing data on a medium, there are ways to manipulate data that mirrors current methods, or creation by a new way of thinking.
In general, "recursive compression" as you have described it is a mirage: compression doesn't actually work that way.
First, you should realize that all compression algorithms have the potential to expand the input file instead of compressing it. You can demonstrate this by a simple counting argument: note that the compressed version of any file must be different from the compressed version of any other file (or you will not be able to decompress that file properly). Also, for any file size N, there is a fixed number of possible files of size <=N. If any files of size > N are compressible to size <= N, then an equal number of files of size <= N must expand to size >N when "compressed".
Second, "truly random" files are uncompressible. Compression works because the compression algorithm expects to receive files with certain kinds of predictable regularities. However, "truly random" files are by definition unpredictable: every random file is as likely as every other random file of the same length, so they don't compress.
Effectively, you have a model which treats some files as more likely than others; to compress such files, you want to choose shorter output files for the input files which are more likely. Information theory tells us the most efficient way to compress files is to assign each input file of probability P an output file of length ~ log2(1/P) bits. This means that, ideally, every output file of a given length has roughly equal probability, just like "truly random" files.
Among completely random files of a given length, each has probability (0.5)^(#original bits). The optimal length from above is ~ log2(1/ 0.5^(#original bits) ) = (#original bits) -- which is to say, the original length is the best you can do.
Because the output of a good compression algorithm is nearly random, re-compressing the compressed file will get you little to no gain. Any further improvements are effectively "leakage" due to suboptimal modeling and encoding; also, compression algorithms tend to scramble any regularity they don't take advantage of, making further compression of such "leakage" more difficult.
For a much longer exposition on this topic, with many examples of failed propositions of this type, see the comp.compression FAQ. Claims of "recursive compression" feature prominently.

file names based on file content

So iow, some algorithm to generate a unique, reasonable length filename based on binary file content. Two files that have the same binary content should have the same name. Obviously there would be limits to this, as presumably you couldn't have unique reasonable length filenames for each of a large set of large files only differing at a handful of bit positions. But presumably there is some heuristic, best approximation to this that for example exploits known attributes of typical image files. If I had the name of some algorithm that does this I can google it and find other approaches as well.
Use an MD5 hash of the contents of the file.
I guess MD5 is worth checking out. Of course it will give you same result if the content is the same but I guess you can increment it until you get unique one.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources