Find duplicate images algorithm [closed] - image

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to create a program that find the duplicate images into a directory, something like this app does and I wonder what would be the algorithm to determine if two images are the same.
Any suggestion is welcome.

This task can be solved by perceptual-hashing, depending on your use-case, combined with some data-structure responsible for nearest-neighbor search in high-dimensions (kd-tree, ball-tree, ...) which can replace the brute-force search (somewhat).
There are tons of approaches for images: DCT-based, Wavelet-based, Statistics-based, Feature-based, CNNs (and more).
Their designs are usually based on different assumptions about the task, e.g. rotation allowed or not?
A google scholar search on perceptual image hashing will list a lot of papers. You can also look for the term image fingerprinting.
Here is some older ugly python/cython code doing the statistics-based approach.
Remark: Digikam can do that for you too. It's using some older Haar-wavelet based approach i think.

Related

What machine learning algorithm to use for face matching? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I want your opinion on what would be the best machine learning algorithm, or even better, library, to use if I wanted to match two faces that look similar. Kind of like how google photos can put photos of the same people in their own album automatically. What's the best way to tackle this?
Face-based user identification isn't only one algorithm, it's a whole process that's still in active research. One way to experiment with it, is to follow these four steps:
Face region detection/extraction using the Histogram of Oriented Gradients (HOG)
Centralize eyes and lips using face landmark estimation
Image encoding using a CNN Model (openface for instance)
Image Search using a voisinage algorithm (KNN, LSHForest, etc)
Here's a blog article that draws a nice walk through the steps required to do face-based user identification:
machine learning is fun part 4 modern face recognition with deep learning

Split text files into two groups - unsupervised learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Imagine, you are a librarian and during time you
have classified a bunch of text files (approx 100)
with a general ambiguous keyword.
Every text file is actually a topic of keyword_meaning1
or a topic of keyword_meaning2.
Which unsupervised learning approach would you use,
to split the text files into two groups?
What precision (in percentage) of correct classification
can be achieved according to a number of text files?
Or can be somehow indicated in one group, that there is
a need of a librarian to check certain files, because
they may be classifed incorrectly?
The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.
Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

Find plagiarism in bulk articles [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me)
What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner.
Thanks
Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

Algorithms under Plagiarism detection machines [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm very impressed to how plagiarism checkers (such as Turnitin website ) works. But how do they do that ? In a very effective way, I'm new to this area thus is there any word matching algorithm or anything that is similar to that is used for detecting alike sentences?
Thank you very much.
I'm sure many real-world plagiarism detection systems use more sophisticated schemes, but the general class of problem of detecting how far apart two things are is called the edit distance. That link includes links to many common algorithms used for this purpose. The gist is effectively answering the question "How many edits must I perform to turn one input into the other?". The challenge for real-world systems is performing this across a large corpus in an efficient manner. A related problem is the longest common subsequence, which might also be useful for such schemes to identify passages that are copied verbatim.

Google similar images algorithm [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Does any one have an idea regarding what sort of algorithm might Google be using to find similar images ?
No, but they could be using SIFT.
I'm not sure this has much to do with image processing. When I ask for "similar images" of the Eiffel tower, I get a bunch of photos of Paris Hilton, and street maps from Paris. Curiously, all of these images have the word "Paris" in the file name.
Currently the Google Image Search provides these filtering options:
Image size
Face detection
Continuous-tone ("Photo") vs. Smooth shading ("Clipart") vs. bitonal("Line drawing")
Color histogram
These options can be seen in its Image Search Result page.
I don't know about faces, but see at least:
http://www.incm.cnrs-mrs.fr/LaurentPerrinet/Publications/Perrinet08spie
Compare two images the python/linux way
I have heard, that one should use this when comparing images
(I mean: make the prob model, calc. the probs, use this):
http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Or then it might even be one of those PCFG things that MIT people tend to use with robotics stuff. One I read used some sort of PCFG model made of basic shapes (that you can rotate magically) and searched the best match with
http://en.wikipedia.org/wiki/Inside-outside_algorithm

Resources