Near-Duplicate Image Detection [closed]

Near-Duplicate Image Detection [closed] - image

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What's a fast way to sort a given set of images by their similarity to each other.
At the moment I have a system that does histogram analysis between two images, but this is a very expensive operation and seems too overkill.
Optimally I am looking for a algorithm that would give each image a score (for example a integer score, such as the RGB Average) and I can just sort by that score. Identical Scores or scores next to each other are possible duplicates.
0299393
0599483
0499994 <- possible dupe
0499999 <- possible dupe
1002039
4995994
6004994
RGB Average per image sucks, is there something similar?

There has been a lot of research on image searching and similarity measures. It's not an easy problem. In general, a single int won't be enough to determine if images are very similar. You'll have a high false-positive rate.
However, since there has been a lot of research done, you might take a look at some of it. For example, this paper (PDF) gives a compact image fingerprinting algorithm that is suitable for finding duplicate images quickly and without storing much data. It seems like this is the right approach if you want something robust.
If you're looking for something simpler, but definitely more ad-hoc, this SO question has a few decent ideas.

I would recommend considering moving away from just using an RGB histogram.
A better digest of your image can be obtained if you take a 2d Haar wavelet of the image (its a lot easier than it sounds, its just a lot of averaging and some square roots used to weight your coefficients) and just retain the k largest weighted coefficients in the wavelet as a sparse vector, normalize it, and save that to reduce its size. You should rescale R G and B using perceptual weights beforehand at least or I'd recommend switching to YIQ (or YCoCg, to avoid quantization noise) so you can sample chrominance information with reduced importance.
You can now use the dot product of two of these sparse normalized vectors as a measure of similarity. The image pairs with the largest dot products are going to be very similar in structure. This has the benefit of being slightly resistant to resizing, hue shifting and watermarking, and being really easy to implement and compact.
You can trade off storage and accuracy by increasing or decreasing k.
Sorting by a single numeric score is going to be intractable for this sort of classification problem. If you think about it it would require images to only be able to 'change' along one axis, but they don't. This is why you need a vector of features. In the Haar wavelet case its approximately where the sharpest discontinuities in the image occur. You can compute a distance between images pairwise, but since all you have is a distance metric a linear ordering has no way to express a 'triangle' of 3 images that are all equally distant. (i.e. think of an image that is all green, an image that is all red and an image that is all blue.)
That means that any real solution to your problem will need O(n^2) operations in the number of images you have. Whereas if it had been possible to linearize the measure, you could require just O(n log n), or O(n) if the measure was suitable for, say, a radix sort. That said, you don't need to spend O(n^2) since in practice you don't need to sift through the whole set, you just need to find the stuff thats nearer than some threshold. So by applying one of several techniques to partition your sparse vector space you can obtain much faster asymptotics for the 'finding me k of the images that are more similar than a given threshold' problem than naively comparing every image against every image, giving you what you likely need... if not precisely what you asked for.
In any event, I used this a few years ago to good effect personally when trying to minimize the number of different textures I was storing, but there has also been a lot of research noise in this space showing its efficacy (and in this case comparing it to a more sophisticated form of histogram classification):
http://www.cs.princeton.edu/cass/papers/spam_ceas07.pdf
If you need better accuracy in detection, the minHash and tf-idf algorithms can be used with the Haar wavelet (or the histogram) to deal with edits more robustly:
http://cmp.felk.cvut.cz/~chum/papers/chum_bmvc08.pdf
Finally, Stanford has an image search based on a more exotic variant of this kind of approach, based on doing more feature extraction from the wavelets to find rotated or scaled sections of images, etc, but that probably goes way beyond the amount of work you'd want to do.
http://wang14.ist.psu.edu/cgi-bin/zwang/regionsearch_show.cgi

I implemented a very reliable algorithm for this called Fast Multiresolution Image Querying. My (ancient, unmaintained) code for that is here.
What Fast Multiresolution Image Querying does is split the image into 3 pieces based on the YIQ colorspace (better for matching differences than RGB). Then the image is essentially compressed using a wavelet algorithm until only the most prominent features from each colorspace are available. These points are stored in a data structure. Query images go through the same process, and the prominent features in the query image are matched against those in the stored database. The more matches, the more likely the images are similar.
The algorithm is often used for "query by sketch" functionality. My software only allowed entering query images via URL, so there was no user interface. However, I found it worked exceptionally well for matching thumbnails to the large version of that image.
Much more impressive than my software is retrievr which lets you try out the FMIQ algorithm using Flickr images as the source. Very cool! Try it out via sketch or using a source image, and you can see how well it works.

A picture has many features, so unless you narrow yourself to one, like average brightness, you are dealing with an n-dimensional problem space.
If I asked you to assign a single integer to the cities of the world, so I could tell which ones are close, the results wouldn't be great. You might, for example, choose time zone as your single integer and get good results with certain cities. However, a city near the north pole and another city near the south pole can also be in the same time zone, even though they are at opposite ends of the planet. If I let you use two integers, you could get very good results with latitude and longitude. The problem is the same for image similarity.
All that said, there are algorithms that try to cluster similar images together, which is effectively what you're asking for. This is what happens when you do face detection with Picasa. Even before you identify any faces, it clusters similar ones together so that it's easy to go through a set of similar faces and give most of them the same name.
There is also a technique called Principle Component Analysis, which lets you reduce n-dimensional data down to any smaller number of dimensions. So a picture with n features could be reduced to one feature. However, this is still not the best approach for comparing images.

There's a C library ("libphash" - http://phash.org/) that will calculate a "perceptual hash" of an image and allow you to detect similar images by comparing hashes (so you don't have to compare each image directly against every other image) but unfortunately it didn't seem to be very accurate when I tried it.

You have to decide what is "similar." Contrast? Hue?
Is a picture "similar" to the same picture upside-down?
I bet you can find a lot of "close calls" by breaking images up into 4x4 pieces and getting an average color for each grid cell. You'd have sixteen scores per image. To judge similarity, you would just do a sum of squares of differences between images.
I don't think a single hash makes sense, unless it's against a single concept like hue, or brightness, or contrast.
Here's your idea:
0299393
0599483
0499994 <- possible dupe
0499999 <- possible dupe
1002039
4995994
6004994
First of all, I'm going to assume these are decimal numbers that are R*(2^16)+G*(2^8)+B, or something like that. Obviously that's no good because red is weighted inordinately.
Moving into HSV space would be better. You could spread the bits of HSV out into the hash, or you could just settle H or S or V individually, or you could have three hashes per image.
One more thing. If you do weight R, G, and B. Weight green highest, then red, then blue to match human visual sensitivity.

In the age of web services you could try http://tineye.com

The question Good way to identify similar images? seems to provide a solution for your question.

i assumed that other duplicate image search software performs an FFT on the images, and stores the values of the different frequencies as a vectors:
Image1 = (u1, u2, u3, ..., un)
Image2 = (v1, v2, v3, ..., vn)
and then you can compare two images for equalness by computing the distance between the weight vectors of two images:
distance = Sqrt(
(u1-v1)^2 +
(u2-v2)^2 +
(u2-v3)^2 +
...
(un-vn)^2);

One solution is to perform a RMS/RSS comparison on every pair of pictures required to perform a bubble sort. Second, you could perform an FFT on each image and do some axis averaging to retrieve a single integer for each image which you would use as an index to sort by. You may consider doing whatever comparison on a resized (25%, 10%) version of the original depending on how small a difference you choose to ignore and how much speedup you require. Let me know if these solutions are interesting, and we can discuss or I can provide sample code.

Most modern approaches to detect Near duplicate image detection use interesting points detection and descriptors describing area around such points. Often SIFT is used. Then you can quatize descriptors and use clusters as visual word vocabulary.
So if we see on ratio of common visual words of two images to all visual words of these images you estimate similarity between images. There are a lot of interesting articles. One of them is Near Duplicate Image Detection: minHash and tf-idf Weighting

For example using IMMI extension and IMMI you can examine many different ways how to measure similarity between images:
http://spl.utko.feec.vutbr.cz/en/component/content/article/46-image-processing-extension-for-rapidminer-5
By defining some threshold and selecting some method you can measure similarity.

Related

How to provide a score value to an image based on pattern information in it?

I saw a few image processing and analysis related questions on this forum and thought I could try this forum for my question. I have a say 30 two-dimensional arrays (to make things simple, although I have a very big data set) which form 30 individual images. Many of these images have similar base structure, but differ in intensities for different pixels. Due to this intensity variation amongst pixels, some images have a prominent pattern (say a larger area with localised intense pixels or high intensity pixels classifying an edge). Some images, also just contain single high intensity pixels randomly distributed without any prominent feature (so basically noise). I am now trying to build an algorithm, which can give a specific score to an image based on different factors like area fraction of high intensity pixels, mean standard deviation, so that I can find out the image with the most prominent pattern (in order words rank them). But these factors depend on a common factor i.e. a user defined threshold, which becomes different for every image. Any inputs on how I can achieve this ranking or a image score in an automated manner (without the use of a threshold)? I initially used Matlab to perform all the processing and area fraction calculations, but now I am using R do the same thing.
Can some amount of machine learning/ random forest stuff help me here? I am not sure. Some inputs would be very valuable.
P.S. If this not the right forum to post, any suggestions on where I can get good advise?

First of all, let me suggest a change in terminology: What you denote as feature is usually called pattern in image prcessing, while what you call factor is usually called feature.
I think that the main weakness of the features you are using (mean, standard deviation) is that they are only based on the statistics of single pixels (1st order statistics) without considering correlations (neighborhood relations of pixels). If you take a highly stuctured image and shuffle the pixels randomly, you will still have the same 1st order statistics.
There are many ways to take these correlations into account. A simple, efficient and therefore popular method is to apply some filters on the image first (high-pass, low-pass etc.) and then get the 1st order statistics of the resulting image. Other methods are based on Fast Fourier Transform (FFT).
Of course machine learning is also an option here. You could try convolutional neural networks for example, but I would try the simple filtering stuff first.

Fastest method to search for a specified item on an image?

Imagine we have a simple 2D drawing, filled it with lots of non-overlapping circles and only a few stars.
If we are to find all the stars among all these circles, I can think of very few methods. Brute force is one of them. Another one is possibly reduce the image size (to the optimal point where you can still distinguish the objects apart) and then apply brute force and map to the original image. The drawback of brute force is of course, it is very time consuming. I am looking for faster methods, possibly the fastest one.
What is the fastest image processing method to search for the specified item on a simple 2D image?

One typical way of looking for an object in an image is through cross correlation. Basically, you look for the position where the cross-correlation between a mask (the object you're attempting to find) and the image is the highest. That position is the likely location of the object you're trying to find.
For the sake of simplicity, I will refer to the object you're attempting to find as a star, but in general it can be any shape.
Some problems with the above approach:
The size of the mask has to match the size of the star. If you don't know the size of the star, then you will have to try different size masks. Image pyramids are more effective than just iteratively trying different size masks, but still require extra effort.
Similarly, the orientations of the mask and the star have to match. If they don't, the cross-correlation won't work.
For these reasons, the more you know about your problem, the simpler it becomes. This is the reason why people have asked you for more information in the comments. A general purpose solution doesn't really exist, to the best of my knowledge. Maybe someone more knowledgeable can correct me on this.
As you've mentioned, reducing the size of the image will help you reduce the computational time of your approach. In my opinion, it's hardly the core element of a solution -- it's just an optional optimization step.

If the shapes are easy to segment from the background, you might be able to compute distinguishing shape/color descriptors. Depending on your problem you could choose descriptors that are invariant to scale, translation or rotation (e.g. compactness, if it is unique to each shape). I do not know if this will be faster, though.
If you already know the exact shape and have an idea about the size, you might want to have a look at the Generalized Hough Transform, which is basically a formalized description of your "brute force algorithm"

As you list a property that the shapes are not overlapping then I assume an efficient algorithm would be able to
cut out all the shapes by scanning the image in some way (I can imagine relatively efficient and simple algorithm for convex shapes)
when you are left with cut out shapes you could use cross relation misha mentioned
You should describe the problem a bit better
can the shapes be rotated or scaled (or some other transform?)
is the background uniform colour
are the shapes uniform colour
are the shapes filled
Depending on the answer on the above questions you might have more less or more simple solutions.
Also, maybe this article might be interesting.
If the shapes are very regular maybe turning them into vectors could fit your needs nicely, but it might be an overkill, really depends what you want to do later.

Step 1: Thresholding - reduce the image to 1 bit (black or white) if the general image set permits it. [For the type of example you cite, my guess is thresholding would work nicely - leaving enough details to find objects].
Step 2: Optionally do some smoothing/noise removal.
Step 3: Use some clustering approach to gather the foreground objects.
Step 4: Use an appropriate heuristic to identify the objects.
The parameters in steps 1/2 will depend a lot on the type of images as well as experimentation/observation. 3 is usually straightforward if you have worked out 1/2 correctly. 4 will depend very much on the problem (for example, in your case identifying stars - which would depend on what is the actual shape of the stars expected in the images).

Image comparison - fast algorithm

I'm looking to create a base table of images and then compare any new images against that to determine if the new image is an exact (or close) duplicate of the base.
For example: if you want to reduce storage of the same image 100's of times, you could store one copy of it and provide reference links to it. When a new image is entered you want to compare to an existing image to make sure it's not a duplicate ... ideas?
One idea of mine was to reduce to a small thumbnail and then randomly pick 100 pixel locations and compare.

Below are three approaches to solving this problem (and there are many others).
The first is a standard approach in computer vision, keypoint matching. This may require some background knowledge to implement, and can be slow.
The second method uses only elementary image processing, and is potentially faster than the first approach, and is straightforward to implement. However, what it gains in understandability, it lacks in robustness -- matching fails on scaled, rotated, or discolored images.
The third method is both fast and robust, but is potentially the hardest to implement.
Keypoint Matching
Better than picking 100 random points is picking 100 important points. Certain parts of an image have more information than others (particularly at edges and corners), and these are the ones you'll want to use for smart image matching. Google "keypoint extraction" and "keypoint matching" and you'll find quite a few academic papers on the subject. These days, SIFT keypoints are arguably the most popular, since they can match images under different scales, rotations, and lighting. Some SIFT implementations can be found here.
One downside to keypoint matching is the running time of a naive implementation: O(n^2m), where n is the number of keypoints in each image, and m is the number of images in the database. Some clever algorithms might find the closest match faster, like quadtrees or binary space partitioning.
Alternative solution: Histogram method
Another less robust but potentially faster solution is to build feature histograms for each image, and choose the image with the histogram closest to the input image's histogram. I implemented this as an undergrad, and we used 3 color histograms (red, green, and blue), and two texture histograms, direction and scale. I'll give the details below, but I should note that this only worked well for matching images VERY similar to the database images. Re-scaled, rotated, or discolored images can fail with this method, but small changes like cropping won't break the algorithm
Computing the color histograms is straightforward -- just pick the range for your histogram buckets, and for each range, tally the number of pixels with a color in that range. For example, consider the "green" histogram, and suppose we choose 4 buckets for our histogram: 0-63, 64-127, 128-191, and 192-255. Then for each pixel, we look at the green value, and add a tally to the appropriate bucket. When we're done tallying, we divide each bucket total by the number of pixels in the entire image to get a normalized histogram for the green channel.
For the texture direction histogram, we started by performing edge detection on the image. Each edge point has a normal vector pointing in the direction perpendicular to the edge. We quantized the normal vector's angle into one of 6 buckets between 0 and PI (since edges have 180-degree symmetry, we converted angles between -PI and 0 to be between 0 and PI). After tallying up the number of edge points in each direction, we have an un-normalized histogram representing texture direction, which we normalized by dividing each bucket by the total number of edge points in the image.
To compute the texture scale histogram, for each edge point, we measured the distance to the next-closest edge point with the same direction. For example, if edge point A has a direction of 45 degrees, the algorithm walks in that direction until it finds another edge point with a direction of 45 degrees (or within a reasonable deviation). After computing this distance for each edge point, we dump those values into a histogram and normalize it by dividing by the total number of edge points.
Now you have 5 histograms for each image. To compare two images, you take the absolute value of the difference between each histogram bucket, and then sum these values. For example, to compare images A and B, we would compute
|A.green_histogram.bucket_1 - B.green_histogram.bucket_1|
for each bucket in the green histogram, and repeat for the other histograms, and then sum up all the results. The smaller the result, the better the match. Repeat for all images in the database, and the match with the smallest result wins. You'd probably want to have a threshold, above which the algorithm concludes that no match was found.
Third Choice - Keypoints + Decision Trees
A third approach that is probably much faster than the other two is using semantic texton forests (PDF). This involves extracting simple keypoints and using a collection decision trees to classify the image. This is faster than simple SIFT keypoint matching, because it avoids the costly matching process, and keypoints are much simpler than SIFT, so keypoint extraction is much faster. However, it preserves the SIFT method's invariance to rotation, scale, and lighting, an important feature that the histogram method lacked.
Update:
My mistake -- the Semantic Texton Forests paper isn't specifically about image matching, but rather region labeling. The original paper that does matching is this one: Keypoint Recognition using Randomized Trees. Also, the papers below continue to develop the ideas and represent the state of the art (c. 2010):
Fast Keypoint Recognition using Random Ferns - faster and more scalable than Lepetit 06
BRIEF: Binary Robust Independent Elementary Features - less robust but very fast -- I think the goal here is real-time matching on smart phones and other handhelds

The best method I know of is to use a Perceptual Hash. There appears to be a good open source implementation of such a hash available at:
http://phash.org/
The main idea is that each image is reduced down to a small hash code or 'fingerprint' by identifying salient features in the original image file and hashing a compact representation of those features (rather than hashing the image data directly). This means that the false positives rate is much reduced over a simplistic approach such as reducing images down to a tiny thumbprint sized image and comparing thumbprints.
phash offers several types of hash and can be used for images, audio or video.

This post was the starting point of my solution, lots of good ideas here so I though I would share my results. The main insight is that I've found a way to get around the slowness of keypoint-based image matching by exploiting the speed of phash.
For the general solution, it's best to employ several strategies. Each algorithm is best suited for certain types of image transformations and you can take advantage of that.
At the top, the fastest algorithms; at the bottom the slowest (though more accurate). You might skip the slow ones if a good match is found at the faster level.
file-hash based (md5,sha1,etc) for exact duplicates
perceptual hashing (phash) for rescaled images
feature-based (SIFT) for modified images
I am having very good results with phash. The accuracy is good for rescaled images. It is not good for (perceptually) modified images (cropped, rotated, mirrored, etc). To deal with the hashing speed we must employ a disk cache/database to maintain the hashes for the haystack.
The really nice thing about phash is that once you build your hash database (which for me is about 1000 images/sec), the searches can be very, very fast, in particular when you can hold the entire hash database in memory. This is fairly practical since a hash is only 8 bytes.
For example, if you have 1 million images it would require an array of 1 million 64-bit hash values (8 MB). On some CPUs this fits in the L2/L3 cache! In practical usage I have seen a corei7 compare at over 1 Giga-hamm/sec, it is only a question of memory bandwidth to the CPU. A 1 Billion-image database is practical on a 64-bit CPU (8GB RAM needed) and searches will not exceed 1 second!
For modified/cropped images it would seem a transform-invariant feature/keypoint detector like SIFT is the way to go. SIFT will produce good keypoints that will detect crop/rotate/mirror etc. However the descriptor compare is very slow compared to hamming distance used by phash. This is a major limitation. There are a lot of compares to do, since there are maximum IxJxK descriptor compares to lookup one image (I=num haystack images, J=target keypoints per haystack image, K=target keypoints per needle image).
To get around the speed issue, I tried using phash around each found keypoint, using the feature size/radius to determine the sub-rectangle. The trick to making this work well, is to grow/shrink the radius to generate different sub-rect levels (on the needle image). Typically the first level (unscaled) will match however often it takes a few more. I'm not 100% sure why this works, but I can imagine it enables features that are too small for phash to work (phash scales images down to 32x32).
Another issue is that SIFT will not distribute the keypoints optimally. If there is a section of the image with a lot of edges the keypoints will cluster there and you won't get any in another area. I am using the GridAdaptedFeatureDetector in OpenCV to improve the distribution. Not sure what grid size is best, I am using a small grid (1x3 or 3x1 depending on image orientation).
You probably want to scale all the haystack images (and needle) to a smaller size prior to feature detection (I use 210px along maximum dimension). This will reduce noise in the image (always a problem for computer vision algorithms), also will focus detector on more prominent features.
For images of people, you might try face detection and use it to determine the image size to scale to and the grid size (for example largest face scaled to be 100px). The feature detector accounts for multiple scale levels (using pyramids) but there is a limitation to how many levels it will use (this is tunable of course).
The keypoint detector is probably working best when it returns less than the number of features you wanted. For example, if you ask for 400 and get 300 back, that's good. If you get 400 back every time, probably some good features had to be left out.
The needle image can have less keypoints than the haystack images and still get good results. Adding more doesn't necessarily get you huge gains, for example with J=400 and K=40 my hit rate is about 92%. With J=400 and K=400 the hit rate only goes up to 96%.
We can take advantage of the extreme speed of the hamming function to solve scaling, rotation, mirroring etc. A multiple-pass technique can be used. On each iteration, transform the sub-rectangles, re-hash, and run the search function again.

My company has about 24million images come in from manufacturers every month. I was looking for a fast solution to ensure that the images we upload to our catalog are new images.
I want to say that I have searched the internet far and wide to attempt to find an ideal solution. I even developed my own edge detection algorithm.
I have evaluated speed and accuracy of multiple models.
My images, which have white backgrounds, work extremely well with phashing. Like redcalx said, I recommend phash or ahash. DO NOT use MD5 Hashing or anyother cryptographic hashes. Unless, you want only EXACT image matches. Any resizing or manipulation that occurs between images will yield a different hash.
For phash/ahash, Check this out: imagehash
I wanted to extend *redcalx'*s post by posting my code and my accuracy.
What I do:
from PIL import Image
from PIL import ImageFilter
import imagehash
img1=Image.open(r"C:\yourlocation")
img2=Image.open(r"C:\yourlocation")
if img1.width<img2.width:
img2=img2.resize((img1.width,img1.height))
else:
img1=img1.resize((img2.width,img2.height))
img1=img1.filter(ImageFilter.BoxBlur(radius=3))
img2=img2.filter(ImageFilter.BoxBlur(radius=3))
phashvalue=imagehash.phash(img1)-imagehash.phash(img2)
ahashvalue=imagehash.average_hash(img1)-imagehash.average_hash(img2)
totalaccuracy=phashvalue+ahashvalue
Here are some of my results:
item1 item2 totalsimilarity
desk1 desk1 3
desk1 phone1 22
chair1 desk1 17
phone1 chair1 34
Hope this helps!

As cartman pointed out, you can use any kind of hash value for finding exact duplicates.
One starting point for finding close images could be here. This is a tool used by CG companies to check if revamped images are still showing essentially the same scene.

I have an idea, which can work and it most likely to be very fast.
You can sub-sample an image to say 80x60 resolution or comparable,
and convert it to grey scale (after subsampling it will be faster).
Process both images you want to compare.
Then run normalised sum of squared differences between two images (the query image and each from the db),
or even better Normalised Cross Correlation, which gives response closer to 1, if
both images are similar.
Then if images are similar you can proceed to more sophisticated techniques
to verify that it is the same images.
Obviously this algorithm is linear in terms of number of images in your database
so even though it is going to be very fast up to 10000 images per second on the modern hardware.
If you need invariance to rotation, then a dominant gradient can be computed
for this small image, and then the whole coordinate system can be rotated to canonical
orientation, this though, will be slower. And no, there is no invariance to scale here.
If you want something more general or using big databases (million of images), then
you need to look into image retrieval theory (loads of papers appeared in the last 5 years).
There are some pointers in other answers. But It might be overkill, and the suggest histogram approach will do the job. Though I would think combination of many different
fast approaches will be even better.

I believe that dropping the size of the image down to an almost icon size, say 48x48, then converting to greyscale, then taking the difference between pixels, or Delta, should work well. Because we're comparing the change in pixel color, rather than the actual pixel color, it won't matter if the image is slightly lighter or darker. Large changes will matter since pixels getting too light/dark will be lost. You can apply this across one row, or as many as you like to increase the accuracy. At most you'd have 47x47=2,209 subtractions to make in order to form a comparable Key.

Picking 100 random points could mean that similar (or occasionally even dissimilar) images would be marked as the same, which I assume is not what you want. MD5 hashes wouldn't work if the images were different formats (png, jpeg, etc), had different sizes, or had different metadata. Reducing all images to a smaller size is a good bet, doing a pixel-for- pixel comparison shouldn't take too long as long as you're using a good image library / fast language, and the size is small enough.
You could try making them tiny, then if they are the same perform another comparison on a larger size - could be a good combination of speed and accuracy...

What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )

If you have a large number of images, look into a Bloom filter, which uses multiple hashes for a probablistic but efficient result. If the number of images is not huge, then a cryptographic hash like md5 should be sufficient.

I think it's worth adding to this a phash solution I built that we've been using for a while now: Image::PHash. It is a Perl module, but the main parts are in C. It is several times faster than phash.org and has a few extra features for DCT-based phashes.
We had dozens of millions of images already indexed on a MySQL database, so I wanted something fast and also a way to use MySQL indices (which don't work with hamming distance), which led me to use "reduced" hashes for direct matches, the module doc discusses this.
It's quite simple to use:
use Image::PHash;
my $iph1 = Image::PHash->new('file1.jpg');
my $p1 = $iph1->pHash();
my $iph2 = Image::PHash->new('file2.jpg');
my $p2 = $iph2->pHash();
my $diff = Image::PHash::diff($p1, $p2);

I made a very simple solution in PHP for comparing images several years ago. It calculates a simple hash for each image, and then finds the difference. It works very nice for cropped or cropped with translation versions of the same image.
First I resize the image to a small size, like 24x24 or 36x36. Then I take each column of pixels and find average R,G,B values for this column.
After each column has its own three numbers, I do two passes: first on odd columns and second on even ones. The first pass sums all the processed cols and then divides by their number ( [1] + [2] + [5] + [N-1] / (N/2) ). The second pass works in another manner: ( [3] - [4] + [6] - [8] ... / (N/2) ).
So now I have two numbers. As I found out experimenting, the first one is a major one: if it's far from the values of another image, they are not similar from the human point of view at all.
So, the first one represents the average brightness of the image (again, you can pay most attention to green channel, then the red one, etc, but the default R->G->B order works just fine). The second number can be compared if the first two are very close, and it in fact represents the overall contrast of the image: if we have some black/white pattern or any contrast scene (lighted buildings in the city at night, for example) and if we are lucky, we will get huge numbers here if out positive members of sum are mostly bright, and negative ones are mostly dark, or vice versa. As I want my values to be always positive, I divide by 2 and shift by 127 here.
I wrote the code in PHP in 2017, and seems I lost the code. But I still have the screenshots:
The same image:
Black & White version:
Cropped version:
Another image, ranslated version:
Same color gamut as 4th, but another scene:
I tuned the difference thresholds so that the results are really nice. But as you can see, this simple algorithm cannot do anything good with simple scene translations.
On a side note I can notice that a modification can be written to make cropped copies from each of two images at 75-80 percent, 4 at the corners or 8 at the corners and middles of the edges, and then by comparing the cropped variants with another whole image just the same way; and if one of them gets a significantly better similarity score, then use its value instead of the default one).

What is currently considered the "best" algorithm for 2D point-matching?

I have two lists containing x-y coordinates (of stars). I could also have magnitudes (brightnesses) attached to each star. Now each star has random position jiggles and there can be a few extra or missing points in each image. My question is, "What is the best 2D point matching algorithm for such a dataset?" I guess both for a simple linear (translation, rotation, scale) and non-linear (say, n-degree polynomials in the coordinates). In the lingo of the point matching field, I'm looking for the algorithms that would win in a shootout between 2D point matching programs with noise and spurious points. There may be a different "winners" depending if the labeling info is used (the magnitudes) and/or the transformation is restricted to being linear.
I am aware that there are many classes of 2D point matching algorithms and many algorithms in each class (literally probably hundreds in total) but I don't know which, if any, is the consider the "best" or the "most standard" by people in the field of computer vision. Sadly, many of the articles to papers I want to read don't have online versions and I can only read the abstract. Before I settle on a particular algorithm to implement it would be good to hear from a few experts to separate the wheat from the chaff.
I have a working matching program that uses triangles but it fails somewhat frequently (~5% of the time) such that the solution transformation has obvious distortions but for no obvious reason. This program was not written by me and is from a paper written almost 20 years ago. I want to write a new implementation that performs most robustly. I am assuming (hoping) that there have been some advances in this area that make this plausible.

If you're interested in star matching, check out the Astrometry.net blind astrometry solver and the paper on it here. They use four point quads to solve star configurations in Flickr pictures of the night sky. Check out this interview.

There is no single "best" algorithm for this. There are lots of different techniques, and each work better than others on specific datasets and types of data.
One thing I'd recommend is to read this introduction to image registration from the tutorials of the Insight Toolkit. ITK supports MANY types of image registration (which is what it sounds like you are attempting), and is very robust in many cases. Most of their users are in the medical field, so you'll have to wade through a lot of medical jargon, but the algorithms and code work with any type of image (including 1,2,3, and n dimensional images, of different types,etc).

You can consider applying your algorithm first only on the N brightest stars, then include progressively the others to refine the result, reducing the search range at the same time.
Using RANSAC for robustness to extra points is also very common.

I'm not sure it would work, but worth a try:
For each star do the circle time ray Fourier transform - centered around it - of all the other stars (note: this is not the standard Fourier transform, which is line times line).
The phase space of circle times ray is integers times line, but since we only have finite accuracy, you just get a matrix; the dimensions of the matrix depend on accuracy. Now try to pair the matrices to one another (e.g. using L_2 norm)

I saw a program on tv a while ago about how researchers were taking pictures of whales and using the spots on them (which are unique for each whale) to id each whale. It used the angles between the spots. By using the angles it didn't matter if the image was rotated or scaled or translated. That sounds similar to what you're doing with your triangles.

I think the "best" (most technical) way would to be to take the Fourier Transform of the original image and of the new linearly modified image. By doing some simple filtering, it should be easy to figure out the orientation and scale of your image with respect to the old one. There is a description of the 2d Fourier Transform here.

Algorithm to compare two images

Given two different image files (in whatever format I choose), I need to write a program to predict the chance if one being the illegal copy of another. The author of the copy may do stuff like rotating, making negative, or adding trivial details (as well as changing the dimension of the image).
Do you know any algorithm to do this kind of job?

These are simply ideas I've had thinking about the problem, never tried it but I like thinking about problems like this!
Before you begin
Consider normalising the pictures, if one is a higher resolution than the other, consider the option that one of them is a compressed version of the other, therefore scaling the resolution down might provide more accurate results.
Consider scanning various prospective areas of the image that could represent zoomed portions of the image and various positions and rotations. It starts getting tricky if one of the images are a skewed version of another, these are the sort of limitations you should identify and compromise on.
Matlab is an excellent tool for testing and evaluating images.
Testing the algorithms
You should test (at the minimum) a large human analysed set of test data where matches are known beforehand. If for example in your test data you have 1,000 images where 5% of them match, you now have a reasonably reliable benchmark. An algorithm that finds 10% positives is not as good as one that finds 4% of positives in our test data. However, one algorithm may find all the matches, but also have a large 20% false positive rate, so there are several ways to rate your algorithms.
The test data should attempt to be designed to cover as many types of dynamics as possible that you would expect to find in the real world.
It is important to note that each algorithm to be useful must perform better than random guessing, otherwise it is useless to us!
You can then apply your software into the real world in a controlled way and start to analyse the results it produces. This is the sort of software project which can go on for infinitum, there are always tweaks and improvements you can make, it is important to bear that in mind when designing it as it is easy to fall into the trap of the never ending project.
Colour Buckets
With two pictures, scan each pixel and count the colours. For example you might have the 'buckets':
white
red
blue
green
black
(Obviously you would have a higher resolution of counters). Every time you find a 'red' pixel, you increment the red counter. Each bucket can be representative of spectrum of colours, the higher resolution the more accurate but you should experiment with an acceptable difference rate.
Once you have your totals, compare it to the totals for a second image. You might find that each image has a fairly unique footprint, enough to identify matches.
Edge detection
How about using Edge Detection.
(source: wikimedia.org)
With two similar pictures edge detection should provide you with a usable and fairly reliable unique footprint.
Take both pictures, and apply edge detection. Maybe measure the average thickness of the edges and then calculate the probability the image could be scaled, and rescale if necessary. Below is an example of an applied Gabor Filter (a type of edge detection) in various rotations.
Compare the pictures pixel for pixel, count the matches and the non matches. If they are within a certain threshold of error, you have a match. Otherwise, you could try reducing the resolution up to a certain point and see if the probability of a match improves.
Regions of Interest
Some images may have distinctive segments/regions of interest. These regions probably contrast highly with the rest of the image, and are a good item to search for in your other images to find matches. Take this image for example:
(source: meetthegimp.org)
The construction worker in blue is a region of interest and can be used as a search object. There are probably several ways you could extract properties/data from this region of interest and use them to search your data set.
If you have more than 2 regions of interest, you can measure the distances between them. Take this simplified example:
(source: per2000.eu)
We have 3 clear regions of interest. The distance between region 1 and 2 may be 200 pixels, between 1 and 3 400 pixels, and 2 and 3 200 pixels.
Search other images for similar regions of interest, normalise the distance values and see if you have potential matches. This technique could work well for rotated and scaled images. The more regions of interest you have, the probability of a match increases as each distance measurement matches.
It is important to think about the context of your data set. If for example your data set is modern art, then regions of interest would work quite well, as regions of interest were probably designed to be a fundamental part of the final image. If however you are dealing with images of construction sites, regions of interest may be interpreted by the illegal copier as ugly and may be cropped/edited out liberally. Keep in mind common features of your dataset, and attempt to exploit that knowledge.
Morphing
Morphing two images is the process of turning one image into the other through a set of steps:
Note, this is different to fading one image into another!
There are many software packages that can morph images. It's traditionaly used as a transitional effect, two images don't morph into something halfway usually, one extreme morphs into the other extreme as the final result.
Why could this be useful? Dependant on the morphing algorithm you use, there may be a relationship between similarity of images, and some parameters of the morphing algorithm.
In a grossly over simplified example, one algorithm might execute faster when there are less changes to be made. We then know there is a higher probability that these two images share properties with each other.
This technique could work well for rotated, distorted, skewed, zoomed, all types of copied images. Again this is just an idea I have had, it's not based on any researched academia as far as I am aware (I haven't look hard though), so it may be a lot of work for you with limited/no results.
Zipping
Ow's answer in this question is excellent, I remember reading about these sort of techniques studying AI. It is quite effective at comparing corpus lexicons.
One interesting optimisation when comparing corpuses is that you can remove words considered to be too common, for example 'The', 'A', 'And' etc. These words dilute our result, we want to work out how different the two corpus are so these can be removed before processing. Perhaps there are similar common signals in images that could be stripped before compression? It might be worth looking into.
Compression ratio is a very quick and reasonably effective way of determining how similar two sets of data are. Reading up about how compression works will give you a good idea why this could be so effective. For a fast to release algorithm this would probably be a good starting point.
Transparency
Again I am unsure how transparency data is stored for certain image types, gif png etc, but this will be extractable and would serve as an effective simplified cut out to compare with your data sets transparency.
Inverting Signals
An image is just a signal. If you play a noise from a speaker, and you play the opposite noise in another speaker in perfect sync at the exact same volume, they cancel each other out.
(source: themotorreport.com.au)
Invert on of the images, and add it onto your other image. Scale it/loop positions repetitively until you find a resulting image where enough of the pixels are white (or black? I'll refer to it as a neutral canvas) to provide you with a positive match, or partial match.
However, consider two images that are equal, except one of them has a brighten effect applied to it:
(source: mcburrz.com)
Inverting one of them, then adding it to the other will not result in a neutral canvas which is what we are aiming for. However, when comparing the pixels from both original images, we can definatly see a clear relationship between the two.
I haven't studied colour for some years now, and am unsure if the colour spectrum is on a linear scale, but if you determined the average factor of colour difference between both pictures, you can use this value to normalise the data before processing with this technique.
Tree Data structures
At first these don't seem to fit for the problem, but I think they could work.
You could think about extracting certain properties of an image (for example colour bins) and generate a huffman tree or similar data structure. You might be able to compare two trees for similarity. This wouldn't work well for photographic data for example with a large spectrum of colour, but cartoons or other reduced colour set images this might work.
This probably wouldn't work, but it's an idea. The trie datastructure is great at storing lexicons, for example a dictionarty. It's a prefix tree. Perhaps it's possible to build an image equivalent of a lexicon, (again I can only think of colours) to construct a trie. If you reduced say a 300x300 image into 5x5 squares, then decompose each 5x5 square into a sequence of colours you could construct a trie from the resulting data. If a 2x2 square contains:
FFFFFF|000000|FDFD44|FFFFFF
We have a fairly unique trie code that extends 24 levels, increasing/decreasing the levels (IE reducing/increasing the size of our sub square) may yield more accurate results.
Comparing trie trees should be reasonably easy, and could possible provide effective results.
More ideas
I stumbled accross an interesting paper breif about classification of satellite imagery, it outlines:
Texture measures considered are: cooccurrence matrices, gray-level differences, texture-tone analysis, features derived from the Fourier spectrum, and Gabor filters. Some Fourier features and some Gabor filters were found to be good choices, in particular when a single frequency band was used for classification.
It may be worth investigating those measurements in more detail, although some of them may not be relevant to your data set.
Other things to consider
There are probably a lot of papers on this sort of thing, so reading some of them should help although they can be very technical. It is an extremely difficult area in computing, with many fruitless hours of work spent by many people attempting to do similar things. Keeping it simple and building upon those ideas would be the best way to go. It should be a reasonably difficult challenge to create an algorithm with a better than random match rate, and to start improving on that really does start to get quite hard to achieve.
Each method would probably need to be tested and tweaked thoroughly, if you have any information about the type of picture you will be checking as well, this would be useful. For example advertisements, many of them would have text in them, so doing text recognition would be an easy and probably very reliable way of finding matches especially when combined with other solutions. As mentioned earlier, attempt to exploit common properties of your data set.
Combining alternative measurements and techniques each that can have a weighted vote (dependant on their effectiveness) would be one way you could create a system that generates more accurate results.
If employing multiple algorithms, as mentioned at the begining of this answer, one may find all the positives but have a false positive rate of 20%, it would be of interest to study the properties/strengths/weaknesses of other algorithms as another algorithm may be effective in eliminating false positives returned from another.
Be careful to not fall into attempting to complete the never ending project, good luck!

Read the paper: Porikli, Fatih, Oncel Tuzel, and Peter Meer. “Covariance Tracking Using Model Update Based
on Means on Riemannian Manifolds”. (2006) IEEE Computer Vision and Pattern Recognition.
I was successfully able to detect overlapping regions in images captured from adjacent webcams using the technique presented in this paper. My covariance matrix was composed of Sobel, canny and SUSAN aspect/edge detection outputs, as well as the original greyscale pixels.

An idea:
use keypoint detectors to find scale- and transform- invariant descriptors of some points in the image (e.g. SIFT, SURF, GLOH, or LESH).
try to align keypoints with similar descriptors from both images (like in panorama stitching), allow for some image transforms if necessary (e.g. scale & rotate, or elastic stretching).
if many keypoints align well (exists such a transform, that keypoint alignment error is low; or transformation "energy" is low, etc.), you likely have similar images.
Step 2 is not trivial. In particular, you may need to use a smart algorithm to find the most similar keypoint on the other image. Point descriptors are usually very high-dimensional (like a hundred parameters), and there are many points to look through. kd-trees may be useful here, hash lookups don't work well.
Variants:
Detect edges or other features instead of points.

It is indeed much less simple than it seems :-) Nick's suggestion is a good one.
To get started, keep in mind that any worthwhile comparison method will essentially work by converting the images into a different form -- a form which makes it easier to pick similar features out. Usually, this stuff doesn't make for very light reading ...
One of the simplest examples I can think of is simply using the color space of each image. If two images have highly similar color distributions, then you can be reasonably sure that they show the same thing. At least, you can have enough certainty to flag it, or do more testing. Comparing images in color space will also resist things such as rotation, scaling, and some cropping. It won't, of course, resist heavy modification of the image or heavy recoloring (and even a simple hue shift will be somewhat tricky).
http://en.wikipedia.org/wiki/RGB_color_space
http://upvector.com/index.php?section=tutorials&subsection=tutorials/colorspace
Another example involves something called the Hough Transform. This transform essentially decomposes an image into a set of lines. You can then take some of the 'strongest' lines in each image and see if they line up. You can do some extra work to try and compensate for rotation and scaling too -- and in this case, since comparing a few lines is MUCH less computational work than doing the same to entire images -- it won't be so bad.
http://homepages.inf.ed.ac.uk/amos/hough.html
http://rkb.home.cern.ch/rkb/AN16pp/node122.html
http://en.wikipedia.org/wiki/Hough_transform

In the form described by you, the problem is tough. Do you consider copy, paste of part of the image into another larger image as a copy ? etc.
What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
If you take a step-back, this is easier to solve if you watermark the master images.
You will need to use a watermarking scheme to embed a code into the image. To take a step back, as opposed to some of the low-level approaches (edge detection etc) suggested by some folks, a watermarking method is superior because:
It is resistant to Signal processing attacks
► Signal enhancement – sharpening, contrast, etc.
► Filtering – median, low pass, high pass, etc.
► Additive noise – Gaussian, uniform, etc.
► Lossy compression – JPEG, MPEG, etc.
It is resistant to Geometric attacks
► Affine transforms
► Data reduction – cropping, clipping, etc.
► Random local distortions
► Warping
Do some research on watermarking algorithms and you will be on the right path to solving your problem. (
Note: You can benchmark you method using the STIRMARK dataset. It is an accepted standard for this type of application.

This is just a suggestion, it might not work and I'm prepared to be called on this.
This will generate false positives, but hopefully not false negatives.
Resize both of the images so that they are the same size (I assume that the ratios of widths to lengths are the same in both images).
Compress a bitmap of both images with a lossless compression algorithm (e.g. gzip).
Find pairs of files that have similar file sizes. For instance, you could just sort every pair of files you have by how similar the file sizes are and retrieve the top X.
As I said, this will definitely generate false positives, but hopefully not false negatives. You can implement this in five minutes, whereas the Porikil et. al. would probably require extensive work.

I believe if you're willing to apply the approach to every possible orientation and to negative versions, a good start to image recognition (with good reliability) is to use eigenfaces: http://en.wikipedia.org/wiki/Eigenface
Another idea would be to transform both images into vectors of their components. A good way to do this is to create a vector that operates in x*y dimensions (x being the width of your image and y being the height), with the value for each dimension applying to the (x,y) pixel value. Then run a variant of K-Nearest Neighbours with two categories: match and no match. If it's sufficiently close to the original image it will fit in the match category, if not then it won't.
K Nearest Neighbours(KNN) can be found here, there are other good explanations of it on the web too: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
The benefits of KNN is that the more variants you're comparing to the original image, the more accurate the algorithm becomes. The downside is you need a catalogue of images to train the system first.

If you're willing to consider a different approach altogether to detecting illegal copies of your images, you could consider watermarking. (from 1.4)
...inserts copyright information into the digital object without the loss of quality. Whenever the copyright of a digital object is in question, this information is extracted to identify the rightful owner. It is also possible to encode the identity of the original buyer along with the identity of the copyright holder, which allows tracing of any unauthorized copies.
While it's also a complex field, there are techniques that allow the watermark information to persist through gross image alteration: (from 1.9)
... any signal transform of reasonable strength cannot remove the watermark. Hence a pirate willing to remove the watermark will not succeed unless they debase the document too much to be of commercial interest.
of course, the faq calls implementing this approach: "...very challenging" but if you succeed with it, you get a high confidence of whether the image is a copy or not, rather than a percentage likelihood.

If you're running Linux I would suggest two tools:
align_image_stack from package hugin-tools - is a commandline program that can automatically correct rotation, scaling, and other distortions (it's mostly intended for compositing HDR photography, but works for video frames and other documents too). More information: http://hugin.sourceforge.net/docs/manual/Align_image_stack.html
compare from package imagemagick - a program that can find and count the amount of different pixels in two images. Here's a neat tutorial: http://www.imagemagick.org/Usage/compare/ uising the -fuzz N% you can increase the error tolerance. The higher the N the higher the error tolerance to still count two pixels as the same.
align_image_stack should correct any offset so the compare command will actually have a chance of detecting same pixels.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio