Gensim Word2Vec Model trained but not saved - gensim

I am using gensim and executed the following code (simplified):
model = gensim.models.Word2Vec(...)
mode.build_vocab(sentences)
model.train(...)
model.save('file_name')
After days my code finished model.train(...). However, during saving, I experienced:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I noticed that there were some npy files generated:
<...>.trainables.syn1neg.npy
<...>.trainables.vectors_lockf.npy
<...>.wv.vectors.npy
Are those intermediate results I can re-use or do I have to rerun the entire process?

Those are parts of the saved model, but unless the master file_name file (a Python-pickled object) exists and is complete, they may be hard to re-use.
However, if your primary interest is the final word-vectors, those are in the .wv.vectors.npy file. If it appears to be full-length (same size at the syn1neg file), it may be complete. What you're missing is the dict that tells you which word is in which index.
So, the following might work:
Repeat the original process, with the exact same corpus & model parameters, but only through the build_vocab() step. At that point, the new model.wv.vocab dict should be identical as the one from the failed-save-run.
Save that model, without ever train()ing it, to a new filename.
Confirming that newmodel.wv.vectors.npy (with randomly-initialized untrained vectors) is the same size as oldmodel.wv.vectors.npy, copy the oldmodel file to the newmodel's name.
Re-load the new model, and run some sanity checks that the words make sense.
Perhaps, save off just the word-vectors, using something like newmodel.wv.save() or newmodel.wvsave_word2vec_format().
Potentially, the resurrected newmodel could also be patched to use the old syn1neg file as well, if it appears complete. It might work to further train the patched model (either with or without having reused the older syn1neg).
Separately: only the very largest corpuses, or an installation missing the gensim cython optimizations, or a machine without enough RAM (and thus swapping during training), would usually require a training session taking days. You might be able to run much faster. Check:
Is any virtual-memory swapping happening during the entire training? If it is, it will be disastrous for training throughput, and you should use a machine with more RAM or be more aggressive about trimming the vocabulary/model size with a higher min_count. (Smaller min_count values mean a larger model, slower training, poor-quality vectors for words with just a few examples, and also counterintuitively worse-quality vectors for more-frequent words too, because of interference from the noisy rare words. It's usually better to ignore lowest-frequency words.)
Is there any warning displayed about a "slow version" (pure Python with no effective multi-threading) being used? If so your training will be ~100X slower than if that problem is resolved. If the optimized code is available, maximum training throughput will likely be achieved with some workers value between 3 and 12 (but never larger than the number of machine CPU cores).
For a very large corpus, the sample parameter can be made more aggressive – such as 1e-04 or 1e-05 instead of the default 1e-03 – and it may both speed training and improve vector quality, by avoiding lots of redundant overtraining of the most-frequent words.
Good luck!

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

ElasticNet extremely slow

I am runnning an Elastic net model using sklearn. My dataset has 70k observations and 20 features. I want to test different parameters and use the following code:
alpha_plot, l1_ratio_plot = np.linspace(min_xlim, max_xlim, 50), np.linspace(0, 1, 10)
alpha_grid, l1_ratio_grid = np.meshgrid(alpha_plot, l1_ratio_plot)
l1_ratio_alpha_grid = np.array([l1_ratio_grid.ravel(), alpha_grid.ravel()]).T
model_coefficients_analysis = []
for i in l1_ratio_alpha_grid:
model_analysis = ElasticNet(alpha=i[1], l1_ratio=i[0], fit_intercept=True, max_iter=10000).fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
I am aware that this can be done with GridsearchCV but it doesn't do the job for me as I need to store the coefficients for every combination of parameters tested. The current code snippet is exceptionally slow. It takes roughly 10 minutes for each of the 50*10 iterations. Is there a way to speed up the process? For example in GridsearchCV there is a parameter n_jobs which can be set equal to -1 to speed up the process. But here I do not seem to find it.
It takes roughly 10 minutes for each of the 50*10 iterations
That seems very high, but you also have rather large data; I can't fit a randomized such dataset in memory in Colab (where I usually run examples for answers here). You might not be able to shrink the first fit time very much, but maybe you can reduce the subsequent fit times by using warm-starting.
Setting warm_start=True and using the same model object for each iteration, the coefficients will be saved as a starting point for the solver in the next iteration:
model_analysis = ElasticNet(fit_intercept=True, max_iter=10000)
for i in l1_ratio_alpha_grid:
model_analysis.set_params(alpha=i[1], l1_ratio=i[0])
model_analysis.fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
You might consider using ElasticNetCV, since it uses warm-starting internally, and it provides some other niceties. You can use a PredefinedSplit if adding k-fold cross-validation is too much of an added expense, but I believe the n_jobs parameter is only useful in splitting up jobs across hyperparameters and folds, so using more cores might mitigate the issues with k-fold (but then you'll also have k times as many coefficients).
Your large max_iter is a bit worrying; do you get nonconvergence? From your independent variable name it seems like you're scaling, but if not that's the place to start: fast (and maybe correct) convergence depends on features with similar scales. You might also consider increasing the convergence criterion tol. I have no experience with the selection parameter, but the docstring suggests changing it to random may speed up convergence?

Assessing doc2vec accuracy

I am trying to assess a doc2vec model based on the code from here. Basically, I want to know the percentual of inferred documents are found to be most similar to itself. This is my current code an:
for doc_id, doc in enumerate(cur.execute('SELECT Text FROM Patents')):
docs += 1
doc = clean_text(doc)
inferred_vector = model.infer_vector(doc)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(doc_id)
ranks.append(rank)
counter = collections.Counter(ranks)
accuracy = counter[0] / docs
This code works perfectly with smaller datasets. However, since I have a huge file with millions of documents, this code becomes too slow, it would take months to compute. I profiled my code and most of the time is consumed by the following line: sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)).
If I am not mistaken, this is having to measure each document to every other document. I think computation time might be massively reduced if I change this to topn=1 instead since the only thing I want to know is if the most similar document is itself or not. Doing this will basically take each doc (i.e., inferred_vector), measure its most similar document (i.e., topn=1), and then I just see if it is itself or not. How could I implement this? Any help or idea is welcome.
To have most_similar() return only the single most-similar document, that is as simple as specifying topn=1.
However, to know which one document of the millions is the most-similar to a single target vector, the similarities to all the candidates must be calculated & sorted. (If even one document was left out, it might have been the top-ranked one!)
Making sure absolutely no virtual-memory swapping is happening will help ensure that brute-force comparison happens as fast as possible, all in RAM – but with millions of docs, it will still be time-consuming.
What you're attempting is a fairly simple "self-check" as to whether training led to self-consistent model: whether the re-inference of a document creates a vector "very close to" the same doc-vector left over from bulk training. Failing that will indicate some big problems in doc-prep or training, but it's not a true measure of the model's "accuracy" for any real task, and the model's value is best evaluated against your intended use.
Also, because this "re-inference self-check" is just a crude sanity check, there's no real need to do it for every document. Picking a thousand (or ten thousand, or whatever) random documents will give you a representative idea of whether most of the re-inferred vectors have this quality, or not.
Similarly, you could simply check the similarity of the re-inferred vector against the single in-model vector for that same document-ID, and check whether they are "similar enough". (This will be much faster, but could also be done on just a random sample of docs.) There's no magic proper threshold for "similar enough"; you'd have to pick one that seems to match your other goals. For example, using scikit-learn's cosine_similarity() to compare the two vectors:
from sklearn.metrics.pairwise import cosine_similarity
# ...
inferred_vector = model.infer_vector(doc_words)
looked_up_vector = model.dv[doc_id]
self_similarity = cosine_similarity([inferred_vector], [looked_up_vector])[0]
# then check that value against some threshold
(You have to wrap the single vectors in lists as arguments to cosine_similarity(), then access the 0th element of the return value, because it is designed to usually work on larger lists of vectors.)
With this calculation, you wouldn't know if, for example, some of the other stored-doc-vectors are a little closer to your inferred target - but that may not be that important, anyway. The docs might be really similar! And while the original "closest to itself" self-check will fail miserably if there were major defects in training, even a well-trained model will likely have some cases where natural model jitter prevents a "closest to itself" for every document. (With more documents inside the same number of dimensions, or certain corpuses with lots of very-similar documents, this would become more common... but not be a concerning indicator of any model problems.)

Speed Up Gensim's Word2vec for a Massive Dataset

I'm trying to build a Word2vec (or FastText) model using Gensim on a massive dataset which is composed of 1000 files, each contains ~210,000 sentences, and each sentence contains ~1000 words. The training was made on a 185gb RAM, 36-core machine.
I validated that
gensim.models.word2vec.FAST_VERSION == 1
First, I've tried the following:
files = gensim.models.word2vec.PathLineSentences('path/to/files')
model = gensim.models.word2vec.Word2Vec(files, workers=-1)
But after 13 hours I decided it is running for too long and stopped it.
Then I tried building the vocabulary based on a single file, and train based on all 1000 files as follows:
files = os.listdir['path/to/files']
model = gensim.models.word2vec.Word2Vec(min_count=1, workers=-1)
model.build_vocab(corpus_file=files[0])
for file in files:
model.train(corpus_file=file, total_words=model.corpus_total_words, epochs=1)
But I checked a sample of word vectors before and after the training, and there was no change, which means no actual training was done.
I can use some advise on how to run it quickly and successfully. Thanks!
Update #1:
Here is the code to check vector updates:
file = 'path/to/single/gziped/file'
total_words = 197264406 # number of words in 'file'
total_examples = 209718 # number of records in 'file'
model = gensim.models.word2vec.Word2Vec(iter=5, workers=12)
model.build_vocab(corpus_file=file)
wv_before = model.wv['9995']
model.train(corpus_file=file, total_words=total_words, total_examples=total_examples, epochs=5)
wv_after = model.wv['9995']
so the vectors: wv_before and wv_after are exactly the same
There's no facility in gensim's Word2Vec to accept a negative value for workers. (Where'd you get the idea that would be meaningful?)
So, it's quite possible that's breaking something, perhaps preventing any training from even being attempted.
Was there sensible logging output (at level INFO) suggesting that training was progressing in your trial runs, either against the PathLineSentences or your second attempt? Did utilities like top show busy threads? Did the output suggest a particular rate of progress & let you project-out a likely finishing time?
I'd suggest using a positive workers value and watching INFO-level logging to get a better idea what's happening.
Unfortunately, even with 36 cores, using a corpus iterable sequence (like PathLineSentences) puts gensim Word2Vec in a model were you'll likely get maximum throughput with a workers value in the 8-16 range, using far less than all your threads. But it will do the right thing, on a corpus of any size, even if it's being assembled by the iterable sequence on-the-fly.
Using the corpus_file mode can saturate far more cores, but you should still specify the actual number of worker threads to use – in your case, workers=36 – and it is designed to work on from a single file with all data.
Your code which attempts to train() many times with corpus_file has lots of problems, and I can't think of a way to adapt corpus_file mode to work on your many files. Some of the problems include:
you're only building the vocabulary from the 1st file, which means any words only appearing in other files will be unknown and ignored, and any of the word-frequency-driven parts of the Word2Vec algorithm may be working on unrepresentative
the model builds its estimate of the expected corpus size (eg: model.corpus_total_words) from the build_vocab() step, so every train() will behave as if that size is the total corpus size, in its progress-reporting & management of the internal alpha learning-rate decay. So those logs will be wrong, the alpha will be mismanaged in a fresh decay each train(), resulting in a nonsensical jigsaw up-and-down alpha over all files.
you're only iterating over each file's contents once, which isn't typical. (It might be reasonable in a giant 210-billion word corpus, though, if every file's text is equally and randomly representative of the domain. In that case, the full corpus once might be as good as iterating over a corpus that's 1/5th the size 5 times. But it'd be a problem if some words/patterns-of-usage are all clumped in certain files - the best training interleaves contrasting examples throughout each epoch and all epochs.)
min_count=1 is almost always unwise with this algorithm, and especially so in large corpora of typical natural-language word frequencies. Rare words, and especially those appearing only once or a few times, make the model gigantic but those words won't get good word-vectors, and keeping them in acts like noise interfering with the improvement of other more-common words.
I recommend:
Try the corpus iterable sequence mode, with logging and a sensible workers value, to at least get an accurate read of how long it might take. (The longest step will be the initial vocabulary scan, which is essentially single-threaded and must visit all data. But you can .save() the model after that step, to then later re-.load() it, tinker with settings, and try different train() approaches without repeating the slow vocabulary survey.)
Try aggressively-higher values of min_count (discarding more rare words for a smaller model & faster training). Perhaps also try aggressively-smaller values of sample (like 1e-05, 1e-06, etc) to discard a larger fraction of the most-frequent words, for faster training that also often improves overall word-vector quality (by spending relatively more effort on less-frequent words).
If it's still too slow, consider if you could using a smaller subsample of your corpus might be enough.
Consider the corpus_file method if you can roll much or all of your data into the single file it requires.

Fastest algorithm to detect duplicate files

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1).
The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.
So the usual algorithm seems to work like this:
generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found
Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.
So I've got the opinion that this scheme is broken in two different dimensions:
duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare
I found one (Windows) app which states to be fast by not using this common hashing scheme.
Am I totally wrong with my ideas and opinion?
[Update]
There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.
[Update 2]
I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question.
"Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.
[Update 3]
Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this:
Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase),
the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.
I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.
[Update 4]
Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version.
So thank you all for the moment.
And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)
Thanks for all your effort!
Marcel
The fastest de-duplication algorithm will depend on several factors:
how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.
Therefore, there is no single way to answer the original question. Fastest when?
Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).
Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.
The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!
Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.
[Update]
The OP seems to be under the impression that
Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.
I have added this segment to counter these arguments:
A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.
My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.
The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.
This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.
But...
When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).
Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.
When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.
Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.
So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.
Byte-by-byte comparison may be faster if all file groups of the same size fit in physical memory OR if you have a very fast SSD. It also may still be slower depending on the number and nature of the files, hashing functions used, cache locality and implementation details.
The hashing approach is a single, very simple algorithm that works on all cases (modulo the extremely rare collision case). It scales down gracefully to systems with small amounts of available physical memory. It may be slightly less than optimal in some specific cases, but should always be in the ballpark of optimal.
A few specifics to consider:
1) Did you measure and discover that the comparison within file groups was the expensive part of the operation? For a 2TB HDD walking the entire file system can take a long time on its own. How many hashing operations were actually performed? How big were the file groups, etc?
2) As noted elsewhere, fast hashing doesn't necessarily have to look at the whole file. Hashing some small portions of the file is going to work very well in the case where you have sets of larger files of the same size that aren't expected to be duplicates. It will actually slow things down in the case of a high percentage of duplicates, so it's a heuristic that should be toggled based on knowledge of the files.
3) Using a 128 bit hash is probably sufficient for determining identity. You could hash a million random objects a second for the rest of your life and have better odds of winning the lottery than seeing a collision. It's not perfect, but pragmatically you're far more likely to lose data in your lifetime to a disk failure than a hash collision in the tool.
4) For a HDD in particular (a magnetic disk), sequential access is much faster than random access. This means a sequential operation like hashing n files is going to be much faster than comparing those files block by block (which happens when they don't fit entirely into physical memory).

Resources