file names based on file content - algorithm

So iow, some algorithm to generate a unique, reasonable length filename based on binary file content. Two files that have the same binary content should have the same name. Obviously there would be limits to this, as presumably you couldn't have unique reasonable length filenames for each of a large set of large files only differing at a handful of bit positions. But presumably there is some heuristic, best approximation to this that for example exploits known attributes of typical image files. If I had the name of some algorithm that does this I can google it and find other approaches as well.

Use an MD5 hash of the contents of the file.

I guess MD5 is worth checking out. Of course it will give you same result if the content is the same but I guess you can increment it until you get unique one.

Related

Build optimal dictionary algorithm

So I had a large plain text script in ASCII, which can have 256 entries from dictionary. Dictionary can contain any string with length 3-6 letters. Each entry takes 2 bytes in script (instead of appropriate dictionary string length).
I had a lot of thought about building optimal dictionary for such scheme. Naive solution is to build any 6-letters substrings at each position, then count each of them and find most frequent. Then exclude this substring from initial script and repeat searching for most frequent for the second most frequent entry.
All of this looks ugly, I believe, there are already serious mathematics behind this task, but I couldn't find any references on that anywhere.
Could you please direct me or describe possible algorithm for mentioned task?
What I was looking for was an MTE codes building algorithm. It was implemented in couple of tools, for instance here (C code is also available inside).
The idea behind is a brute force: just like I've described. The importand addition is that when you find best code possible, you also need to count actual MTE code length and take into account only saved length minus MTE codes length, which will be taken.

List all duplicate files in a filesystem given the root directory.

How would you go about designing an algorithm to list all the duplicate files in a filesystem? My first thought it to use hashing but I'm wondering if there's a better way to do it. Any possible design tradeoffs to keep in mind?
Hashing all your files will take a very long time because you have to read all the file contents.
I would recommend a 3-step algorithm:
scan your directories and note down the paths & sizes of the files
Hash only the files which have the same size as other files, only if there are more than 2 files with the same size: if a file has the same size as only one other file, you don't need the hashing, just compare their contents one-to-one (saves hashing time, you won't need the hash value afterwards)
Even if the hash is the same, you still have to compare the files byte-per-byte because hash can be identical for different files (although this is very unlikely if the file size is the same and it's your filesystem).
You could also do without hashing at all, opening all files at the same time if possible, and compare contents. That would save a multiple read on big files. There are a lot of tweaks that you could implement to save time depending on the type of your data (ex: if 2 compressed/tar files have the same size > x Ggigabytes size (and the same name), don't read the contents, given your process, the files are very likely to be duplicates)
That way, you avoid hashing files which size is unique in the system. Saves a lot of time.
Note: I don't take names into account here, because I suppose names can be different.
EDIT: I've done a bit of research (too late) and found out that fdupes seems to do exactly that if you are using Un*x-like systems:
https://linux.die.net/man/1/fdupes
seen in that question: List duplicate files in a directory in Unix

Finding partly similar files in large archive

I have an archive of about 100 million binary files. New files get added regularly. The file sizes range from about 0.1 MB to about 800 MB.
I can easily determine if files are probably completely identical by comparing their sizes and if the sizes match, by comparing the hashes of the files.
I want to find files that have partly similar content. With that I mean that I believe they have some parts that are identical and some parts that can be different.
What is the best, or any realistic way to find which files are similar to which other files, and if possible get some measure of how similar they are?
Edit:
The files are mostly executables.
They are similar if, say, somewhere between 10% and 100% of their contents are the same as the contents of another file. The lower limit could also be set to 50%. The exact lower limit is not important.
I guess some form of hashing would be needed for this comparison to be doable over such an archive.
It depends on how you will be determining similarity, if for example you could determine similarity by comparing just the first 100 bytes of each file then I guess this would be achievable but to find a particular string comparison in 100 million files that can be 800MB large would be quite infeasible.
Not an easy problem. The first step is to map each file into a set of hashes, i.e., integers. Ideally you want to do that by computing the hashes of a set of substrings in each file such that the substrings are uniformly distributed throughout the file but also the likelihood that a substring occurs in dissimilar files is rare. For example, if the files were English text you could choose to split the file into substrings at all the most common English words (the, to, be, of, and, ...). To do that with the executables I would first compute what the most common byte pairs or triples of all the files are and choose the top N to split the files that hopefully generate substrings that are "not too long." Just what "not to long" is with executables is something don't have a good idea of.
Once you hash those substrings you have the problem of finding similar sets, which is called the set similarity joins problem in computer science. See my post here for methods/code to solve that problem. Good luck!

Iterative Hash Algorithm for Fast File Check

I want to create a representation of the state of all files in a folder (ignoring order), so that I can send this state to another computer to check if we are in sync. This "state representation" is 3 numbers concatenated by . which are:
sum . product . number of items
The "sum" is the numerical addition all of the file's md5 numerical representations.
The product is the multiplication of all of the file's md5 numerical representations.
The number of items is just the number of files.
The main reason for doing this is that this allows me to create unique states iteratively/quickly when I add or delete a file (a modification being a combination of delete then add). Also, one should end up with the same "state" even if the same set of operations are performed in any random order.
Adding A File
Generate the file's md5
Calculate the md5's numerical value (x).
Add x to the sum
Multiply the product by x
Increment the number of items.
Removing A File
Generate the file's md5
Calculate the md5's numerical value (x).
Subtract x from the sum
Divide the product by x
Decrement the number of items.
Problems
Since the numerical representations of hashes can be quite large, I may have to use a library to generate results using strings rather than integers which may be quite slow.
With the limited testing I have done, I have not been able to create "collisions" where a collision is where two different sets of file hashes could produce the same state (remember that we are ignoring the order of the file hashes).
Question
I'm sure that I can't be the first person to want to achieve such a thing. Is there an algorithm or iterative hash function that aims to do the same thing already, preferably in PHP, Java, or Python? Is there a term for this type of thing, all I could think of was "iterative hash"? Is there a flaw with this algorithm that I haven't spotted, such as with "collisions" making generated state representations non-unique?
How many states can your file system take ? infinity for all practical purposes.
How long is your hash length ? short enough to be efficient, finite in any case.
Will I get collisions ? Yes.
So, your hash approach seems fine, particularly if it spreads correctly points that are close, i.e. the state of the fs differing by content of just one file hashes to very different values.
However, you should depend on your hash to produce collisions in the long run, it's a mathematical certainty that probability goes to one that someday you get a collision, given that collision chance is not 0.
So to be really safe, you probably need a full MD5 exchange, if speed and fast updates are the goal your scheme sounds good, but I would back it up with more infrequent exchanges of longer keys, just to be on the safe side if sync is mission critical.

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Resources