Algorithm to simulate color blindness?

Algorithm to simulate color blindness? - algorithm

There are many tools online that take images and simulate what that image might look like to someone with color blindness. However, I can't find any descriptions of these algorithms.
Is there a standard algorithm used to simulate color blindness? I'm aware that there are many types of color blindness (see the Wikipedia page on the subject for more details), but I'm primarily interested in algorithms for simulating dichromacy.

I had the same frustration and wrote an article comparing opensource color blindness simulations. In short, there are four main algorithms:
Coblis and the "HCIRN Color Blind Simulation function". You'll find this one in many places, and a Javascript implementation by MaPePeR. The full HCIRN simulation function was not properly evaluated but is reasonable in practice. However the "ColorMatrix" approximation by colorjack is very inaccurate and should be totally avoided (the author himself said that). Unfortunately it's still widespread as it was easy to copy/paste.
"Computerized simulation of color appearance for dichromats" by Brettel, Viénot, and Mollon (1997). A very solid reference. Works for all kinds of dichromacies. I wrote a public domain C implementation in libDaltonLens.
"Digital video colourmaps for checking the legibility of displays by dichromats" by Viénot, Brettel and Mollon (1999). A solid reference too, simplifies the 1997 paper for protanopia and deuteranopia (2 of the 3 kinds of color blindness). Also in libDaltonLens.
"A Physiologically-based Model for Simulation of Color Vision Deficiency" by Machado et al. (2009). Precomputed matrices are available on their website, which makes it easy to implement yourself. You just need to add the conversion from sRGB to linearRGB.

Looks like you're answer is in the wikipedia entry you linked.
For example:
Protanopia (1% of males): Lacking the long-wavelength sensitive
retinal cones, those with this condition are unable to distinguish
between colors in the green–yellow–red section of the spectrum. They
have a neutral point at a greenish wavelength around 492 nm – that is,
they cannot discriminate light of this wavelength from white.
So you need to de-saturate any colors in the green-yellow-red spectrum to white.
Image color saturation
The other 2 types of dichromacy can be handled similarly.

First we have to understand how the eye works:
A regular/healthy eye has 3 types of cones and 1 type of rods that have activation functions over the visible spectrum of light.
Their activations then pass through some function to produce the signal that goes to your brain. Roughly speaking, the function takes 4 channels as input and produces 3 channels as output (namely lightness, yellow-blue and red-green).
A colorblind person would have one of those two things be different (afaik usually/always 1.), so for example the person would be missing one type of cone or the cone's activation would be different.
The best thing to do would be:
Convert all pixels from RGB space to a combination of frequencies (with intensities). To do this, first take calculate the activations of each of the three cones (of a healthy person) then find a "natural" solution for a set of frequencies (+ intensities) that would result in the same activation. Of course, one solution is just the original three RGB frequencies with their intensities, but it is unlikely that the original image actually had that. A natural solution would be for example a normal distribution around some frequency (or even just one frequency).
Then, (again for each pixel) calculate the activations of a colorblind person's cones to your combination of frequencies.
Finally, find an RGB value such that a healthy person would have the same activations as the ones the colorblind person has.
Note that, if the way these activations are combined is also different for the relevant type of colorblindness, you might want to carry that out as well in the above steps. (So instead of matching activations, you are matching the result of the function over the activations).

Related

How to compare two contours? (font comparison)

I'm trying to analyse two contours and give a percent corresponding to their similarity. Assuming I have all the point's coordinates describing these contours (just like an SVG path), based on which factor should I tell they're almost identical ?
After some Google searches, I found something related to Fourier descriptors, are they relevant for my case ?
Edit
What I want to do is to compare several fonts to another one. Just like would do What the font, but not with an image. Thanks to the produced algorithm, it would be possible to find a font equivalent according to the similarity percentage.
Some scripts just compare the bounding box for each letters, but it's not enough. I need a way to tell that Arial is closest to Verdana than to Webdings. So assuming I can extract the contour from the fonts, I need a way to compare two contours.
For example (with "logical" percent values):

there are two basic ways to approach the general problem (font matching): symbolic and statistical. a good solution will probably combine both in some way.
a symbolic approach uses your knowledge of the problem in a direct way. for example, you can make a list of the things you (as an intelligent human) would use to characterise fonts. the kind of questions that identifont uses. this approach means writing routines that are smart enough to detect the various properties (eg stroke width, whether certain loops are closed, existence of serifs, etc) plus a decision tree (or "rule engine") that puts the yes/no/unsure answers together and comes up with an answer.
the statistical approach sounds more like what you were thinking about, and is probably how what the font works. here the idea is to find some general properties and use those as weights to find a "best" selection. for example, if you have lots of fonts then you can train a neural net (input being pixels at some sample resolution). there you don't need to know "how" the net decides - just that given enough training data it will find a way to do so. or you could just look at the sum of all the dark pixels - that would likely give you results similar to your percentages above.
this sounds simple, but often it's not so easy to find simple statistical measurements that show differences well in all the ways you want.
so then there's a large middle ground between the two. the idea being that if you can pull in some of the ideas from the first group then you can make the approaches in the second much more efficient. while the simplest neural net approach is "all in one" (it includes the calculations and the decisions) you can separate those out. so instead of just giving the net a bunch of pixels you can give it more "meaningful" inputs - things that you know help detect between different fonts. things like stroke width, or the number of "holes" in the character. you can also add some smarts to remove things that might otherwise confuse results - for example, pre-scaling to the same height (if you have a full font set then you can scale everything so that the height of a lowercase "m", say, is constant).
fourier descriptors are a way of characterising the "outside shape" of something and so could be used as an input to a statistical approach as i've described above. in the example you give the fourier descriptors will pick up the "spikiness" of the serifs in the lower G, and so would indicate that it is very different from the G on the left. but they care much less about stroke width and nothing at all about scale (magnification/zoom) (which can be a good or bad thing - if you're being given random letters of different sizes, you don't want to be sensitive to size, but if you've normalized to a standard "m" for an entire alphabet then you certainly do want to include that). since the output is just a spectrum you can compare different letters by cross-correlation of use something like PCA to categorize different types of letter.
other ideas would be 2d cross-correlation (the maximum of the normalised correlation gives you some idea of how similar two things are) or simply seeing what fraction of pixels are common in both letters.
as the comments say, this is a huge problem (and i am not an expert - the above is just random bullshit from being an interested bystander).
but, to finally answer your question, if what you have is an outline, then a fourier descriptor would be a good place to start. since that focuses on shape rather than "weight" i would combine that with something like total area enclosed by the outline. then write some code to calculate those and see what numbers you get for some example alphabets. if it seems to distinguish some letters, but not others, then look for some other measurements that would help in those cases. you will likely end up combining quite a few approaches to get something both fast and reliable.
alternatively, if you just want something simple, try using some easy-to-measure values like height, width, total number of pixels "inside" the contours, how many strokes you cross along vertical or horizontal lines, etc. combining a bunch of those could get you something "good enough" for some purposes, if you aren't comfortable with the maths involved in fourier transforms etc.

Have you considered using a neural network based approach? This paper uses a Self-Organizing Tree map to perform content based image retrieval. With a good training set, it should be possible to create a multilayer network (or SOM) that can give you an accurate similarity measure.

Calculate how humans perceive similarity between different colours

I'm working on a site where users can describe a physical object using (amongst many other things) any color in the rgb 0-255 range. We offer some simplified palettes for easy clicking but a full color wheel is a requirement.
Behind the scenes, one of the processes compares two user descriptions of the object and scores them for similarity.
What I'm trying to do is get a score for how similar the 2 colors are in terms of human perception . Basically, the algorithm needs to determine if a 2 humans picking 2 different colors could be describing the same object. Thus Light Red->Red should be 100%, Most of the shades of grey will be 100% to each other, etc but red-> green is definitely not a match.
To get a decent look at how the algorithms were working, I plotted grayscale and 3 intensities of each hue against every other color in the set and indicated no match (0%) with black, visually identical (100%) with white and grayscale to indicate the intermediate values.
My first (very simplistic approach) was to simply treat the RGB values as co-ordinates in the colour cube and work out the distance (magnitude of the vector) between them.
This threw out a number of problems with regards to Black->50% Grey being a larger distance than (say) Black->50% Blue. having run hundreds of comparisons and asked for feedback, this doesn't seem to match human perception (shown below)
Method 2 converted the RGB values into HSV. I then generated a score based 80% on hue with the other 20% on Sat/Lum. This seems to be the best method so far but still throws some odd matches
Method 3 was an attempt at a hybrid - HSL Values were calculated but the final score was based upon the distance between the 2 colors in the HSL color cylinder space (as in 3D polar co-ordinates).
I feel like I must be re-inventing the wheel - surely this has been done before? I can't find any decent examples on Google and as you can see my approach leaves something to be desired.
So, my question is:
Is there a standard way to do this? If so, how? If not, can anyone suggest a way to improve my approach? I can provide code snippets if required but be warned it's currently messy as hell due to 3 days of tweaking.
Solution (Delta E 2000):
Using the suggestions provided below, I've implemented a Delta E 2000 comparer. I've had to tweak the weighting values to be quite large - I'm not looking for colors which are imperceptibly different but which are not hugely different. In case anyone's interested, the resulting plot is below...

There are a half dozen or so possibilities. EasyRGB has a page devoted to them. Of those listed, DeltaE 2000 probably has the best correlation with human perception -- and is also extremely complex to compute. Delta CMC is almost as good for something like half the code (though the computation still isn't entirely trivial).

I'm not 100% clear on how your problem is set up, but you may want to read up on: Normalized Cross Correlation, and Lab and CIEXYZ color spaces.

This sounds like a prime example for a neural net based approach (if you are in an experimenting mode :) because it's about creating a decision rule that mimics Human perception. A neural net that has six inputs (r, r', g, g', b, b') and one output (is_similar) can be easily trained by using e.g. your own perception of similarity as the training source!

Explaining the AdaBoost Algorithms to non-technical people

I've been trying to understand the AdaBoost algorithm without much success. I'm struggling with understanding the Viola Jones paper on Face Detection as an example.
Can you explain AdaBoost in laymen's terms and present good examples of when it's used?

Adaboost is an algorithm that combines classifiers with poor performance, aka weak learners, into a bigger classifier with much higher performance.
How does it work? In a very simplified manner:
Train a weak learner.
Add it to the set of weak learners trained so far (with an optimal weight)
Increase the importance of samples that are still miss-classified.
Go to 1.
There is a broad and detailed theory behind the scenes, but the intuition is just that: let each "dumb" classifier focus on the mistakes the previous ones were not able to fix.
AdaBoost is one of the most used algorithms in the machine learning community. In particular, it is useful when you know how to create simple classifiers (possibly many different ones, using different features), and you want to combine them in an optimal way.
In Viola and Jones, each different type of weak-learner is associated to one of the 4 or 5 different Haar features you can have.

AdaBoost uses a number of training sample images (such as faces) to pick a number of good 'features'/'classifiers'. For face recognition a classifiers is typically just a rectangle of pixels that has a certain average color value and a relative size. AdaBoost will look at a number of classifiers and find out which one is the best predictor of a face based on the sample images. After it has chosen the best classifier it will continue to find another and another until some threshold is reached and those classifiers combined together will provide the end result.
This part you may not want to share with non-technical people :) but it is interesting anyway. There are several mathematical tricks which make AdaBoost fast for face recognition such as the ability to add up all the color values of an image and store them in a 2 dimensional array so that the value in any position will be the sum of all the pixels up and to the left of that position. This array can be used to very quickly calculate the average color value of any rectangle within the image by subtracting the value found in the top left corner from the value found in the bottom right corner and dividing by the number of pixels in the rectangle. Using this trick you can quickly scan over an entire image looking for rectangles of different relative sizes that match or are close to a particular color.
Hope this helps.

This is understandable. Most of the papers you can find on Internet retell Viola-Jones and Freund-Shapire papers which are foundation of AdaBoost applied for face recognition in OpenCV. And they mostly consist of difficult formulas and algorithms from several mathematical areas combined.
Here is what can help you (short enough) -
1 - It is used in object and, mostly, in face detection-recognition.The most popular and quite good C++ library is OpenCV from Intel originally. I take the part of Face detection in OpenCV, as an example.
2 - First, a cascade of boosted classifiers working with sample rectangles ("features") is trained on sample of images with faces (called positive) and without faces (negative).
From some Googled paper:
"· Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb.
· It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier.
· To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner.
· The boosting algorithm repeatedly calls this weak learner, each time feeding it a different distribution over the training data (in AdaBoost).
· Each call generates a weak classifier and we must combine all of these into a single classifier that, hopefully, is much more accurate than any one of the rules."
During this process the images are scanned to determine the distinctive areas corresponding to certain part of every face. The complex calculation-hypothesis based algorithms are applied (which are not so difficult to understand once you get the main idea).
This can take a week and the output is an XML file which contains the learned information on how to quickly detect the human face, say, in frontal position on any picture (it can be any object in other case).
3 - After that you supply this file to OpenCV face detection program which runs quite fast with up to 99% positive rate (depending on conditions).
As was mentioned here, the scanning speed can be increased greatly with technique known as "integral image".
And finally, these are helpful sources - Object Detection in OpenCV and
Generic Object Detection using AdaBoost from University of California, 2008.

Novel fitness measure for evolutionary image matching simulation

I'm sure many people have already seen demos of using genetic algorithms to generate an image that matches a sample image. You start off with noise, and gradually it comes to resemble the target image more and more closely, until you have a more-or-less exact duplicate.
All of the examples I've seen, however, use a fairly straightforward pixel-by-pixel comparison, resulting in a fairly predictable 'fade in' of the final image. What I'm looking for is something more novel: A fitness measure that comes closer to what we see as 'similar' than the naive approach.
I don't have a specific result in mind - I'm just looking for something more 'interesting' than the default. Suggestions?

I assume you're talking about something like Roger Alsing's program.
I implemented a version of this, so I'm also interested in alternative fitness functions, though I'm coming at it from the perspective of improving performance rather than aesthetics. I expect there will always be some element of "fade-in" due to the nature of the evolutionary process (though tweaking the evolutionary operators may affect how this looks).
A pixel-by-pixel comparison can be expensive for anything but small images. For example, the 200x200 pixel image I use has 40,000 pixels. With three values per pixel (R, G and B), that's 120,000 values that have to be incorporated into the fitness calculation for a single image. In my implementation I scale the image down before doing the comparison so that there are fewer pixels. The trade-off is slightly reduced accuracy of the evolved image.
In investigating alternative fitness functions I came across some suggestions to use the YUV colour space instead of RGB since this is more closely aligned with human perception.
Another idea that I had was to compare only a randomly selected sample of pixels. I'm not sure how well this would work without trying it. Since the pixels compared would be different for each evaluation it would have the effect of maintaining diversity within the population.
Beyond that, you are in the realms of computer vision. I expect that these techniques, which rely on feature extraction, would be more expensive per image, but they may be faster overall if they result in fewer generations being required to achieve an acceptable result. You might want to investigate the PerceptualDiff library. Also, this page shows some Java code that can be used to compare images for similarity based on features rather than pixels.

A fitness measure that comes closer to what we see as 'similar' than the naive approach.
Implementing such a measure in software is definitely nontrivial. Google 'Human vision model', 'perceptual error metric' for some starting points. You can sidestep the issue - just present the candidate images to a human for selecting the best ones, although it might be a bit boring for the human.

I haven't seen such a demo (perhaps you could link one). But a couple proto-ideas from your desription that may trigger an interesting one:
Three different algorithms running in parallel, perhaps RGB or HSV.
Move, rotate, or otherwise change the target image slightly during the run.
Fitness based on contrast/value differences between pixels, but without knowing the actual colour.
...then "prime" a single pixel with the correct colour?

I would agree with other contributors that this is non-trivial. I'd also add that it would be very valuable commercially - for example, companies who wish to protect their visual IP would be extremely happy to be able to trawl the internet looking for similar images to their logos.
My naïve approach to this would be to train a pattern recognizer on a number of images, each generated from the target image with one or more transforms applied to it: e.g. rotated a few degrees either way; a translation a few pixels either way; different scales of the same image; various blurs and effects (convolution masks are good here). I would also add some randomness noise to the each of the images. The more samples the better.
The training can all be done off-line, so shouldn't cause a problem with runtime performance.
Once you've got a pattern recognizer trained, you can point it at the the GA population images, and get some scalar score out of the recognizers.
Personally, I like Radial Basis Networks. Quick to train. I'd start with far too many inputs, and whittle them down with principle component analysis (IIRC). The outputs could just be a similiarity measure and dissimilarity measure.
One last thing; whatever approach you go for - could you blog about it, publish the demo, whatever; let us know how you got on.

Algorithm to compare two images

Given two different image files (in whatever format I choose), I need to write a program to predict the chance if one being the illegal copy of another. The author of the copy may do stuff like rotating, making negative, or adding trivial details (as well as changing the dimension of the image).
Do you know any algorithm to do this kind of job?

These are simply ideas I've had thinking about the problem, never tried it but I like thinking about problems like this!
Before you begin
Consider normalising the pictures, if one is a higher resolution than the other, consider the option that one of them is a compressed version of the other, therefore scaling the resolution down might provide more accurate results.
Consider scanning various prospective areas of the image that could represent zoomed portions of the image and various positions and rotations. It starts getting tricky if one of the images are a skewed version of another, these are the sort of limitations you should identify and compromise on.
Matlab is an excellent tool for testing and evaluating images.
Testing the algorithms
You should test (at the minimum) a large human analysed set of test data where matches are known beforehand. If for example in your test data you have 1,000 images where 5% of them match, you now have a reasonably reliable benchmark. An algorithm that finds 10% positives is not as good as one that finds 4% of positives in our test data. However, one algorithm may find all the matches, but also have a large 20% false positive rate, so there are several ways to rate your algorithms.
The test data should attempt to be designed to cover as many types of dynamics as possible that you would expect to find in the real world.
It is important to note that each algorithm to be useful must perform better than random guessing, otherwise it is useless to us!
You can then apply your software into the real world in a controlled way and start to analyse the results it produces. This is the sort of software project which can go on for infinitum, there are always tweaks and improvements you can make, it is important to bear that in mind when designing it as it is easy to fall into the trap of the never ending project.
Colour Buckets
With two pictures, scan each pixel and count the colours. For example you might have the 'buckets':
white
red
blue
green
black
(Obviously you would have a higher resolution of counters). Every time you find a 'red' pixel, you increment the red counter. Each bucket can be representative of spectrum of colours, the higher resolution the more accurate but you should experiment with an acceptable difference rate.
Once you have your totals, compare it to the totals for a second image. You might find that each image has a fairly unique footprint, enough to identify matches.
Edge detection
How about using Edge Detection.
(source: wikimedia.org)
With two similar pictures edge detection should provide you with a usable and fairly reliable unique footprint.
Take both pictures, and apply edge detection. Maybe measure the average thickness of the edges and then calculate the probability the image could be scaled, and rescale if necessary. Below is an example of an applied Gabor Filter (a type of edge detection) in various rotations.
Compare the pictures pixel for pixel, count the matches and the non matches. If they are within a certain threshold of error, you have a match. Otherwise, you could try reducing the resolution up to a certain point and see if the probability of a match improves.
Regions of Interest
Some images may have distinctive segments/regions of interest. These regions probably contrast highly with the rest of the image, and are a good item to search for in your other images to find matches. Take this image for example:
(source: meetthegimp.org)
The construction worker in blue is a region of interest and can be used as a search object. There are probably several ways you could extract properties/data from this region of interest and use them to search your data set.
If you have more than 2 regions of interest, you can measure the distances between them. Take this simplified example:
(source: per2000.eu)
We have 3 clear regions of interest. The distance between region 1 and 2 may be 200 pixels, between 1 and 3 400 pixels, and 2 and 3 200 pixels.
Search other images for similar regions of interest, normalise the distance values and see if you have potential matches. This technique could work well for rotated and scaled images. The more regions of interest you have, the probability of a match increases as each distance measurement matches.
It is important to think about the context of your data set. If for example your data set is modern art, then regions of interest would work quite well, as regions of interest were probably designed to be a fundamental part of the final image. If however you are dealing with images of construction sites, regions of interest may be interpreted by the illegal copier as ugly and may be cropped/edited out liberally. Keep in mind common features of your dataset, and attempt to exploit that knowledge.
Morphing
Morphing two images is the process of turning one image into the other through a set of steps:
Note, this is different to fading one image into another!
There are many software packages that can morph images. It's traditionaly used as a transitional effect, two images don't morph into something halfway usually, one extreme morphs into the other extreme as the final result.
Why could this be useful? Dependant on the morphing algorithm you use, there may be a relationship between similarity of images, and some parameters of the morphing algorithm.
In a grossly over simplified example, one algorithm might execute faster when there are less changes to be made. We then know there is a higher probability that these two images share properties with each other.
This technique could work well for rotated, distorted, skewed, zoomed, all types of copied images. Again this is just an idea I have had, it's not based on any researched academia as far as I am aware (I haven't look hard though), so it may be a lot of work for you with limited/no results.
Zipping
Ow's answer in this question is excellent, I remember reading about these sort of techniques studying AI. It is quite effective at comparing corpus lexicons.
One interesting optimisation when comparing corpuses is that you can remove words considered to be too common, for example 'The', 'A', 'And' etc. These words dilute our result, we want to work out how different the two corpus are so these can be removed before processing. Perhaps there are similar common signals in images that could be stripped before compression? It might be worth looking into.
Compression ratio is a very quick and reasonably effective way of determining how similar two sets of data are. Reading up about how compression works will give you a good idea why this could be so effective. For a fast to release algorithm this would probably be a good starting point.
Transparency
Again I am unsure how transparency data is stored for certain image types, gif png etc, but this will be extractable and would serve as an effective simplified cut out to compare with your data sets transparency.
Inverting Signals
An image is just a signal. If you play a noise from a speaker, and you play the opposite noise in another speaker in perfect sync at the exact same volume, they cancel each other out.
(source: themotorreport.com.au)
Invert on of the images, and add it onto your other image. Scale it/loop positions repetitively until you find a resulting image where enough of the pixels are white (or black? I'll refer to it as a neutral canvas) to provide you with a positive match, or partial match.
However, consider two images that are equal, except one of them has a brighten effect applied to it:
(source: mcburrz.com)
Inverting one of them, then adding it to the other will not result in a neutral canvas which is what we are aiming for. However, when comparing the pixels from both original images, we can definatly see a clear relationship between the two.
I haven't studied colour for some years now, and am unsure if the colour spectrum is on a linear scale, but if you determined the average factor of colour difference between both pictures, you can use this value to normalise the data before processing with this technique.
Tree Data structures
At first these don't seem to fit for the problem, but I think they could work.
You could think about extracting certain properties of an image (for example colour bins) and generate a huffman tree or similar data structure. You might be able to compare two trees for similarity. This wouldn't work well for photographic data for example with a large spectrum of colour, but cartoons or other reduced colour set images this might work.
This probably wouldn't work, but it's an idea. The trie datastructure is great at storing lexicons, for example a dictionarty. It's a prefix tree. Perhaps it's possible to build an image equivalent of a lexicon, (again I can only think of colours) to construct a trie. If you reduced say a 300x300 image into 5x5 squares, then decompose each 5x5 square into a sequence of colours you could construct a trie from the resulting data. If a 2x2 square contains:
FFFFFF|000000|FDFD44|FFFFFF
We have a fairly unique trie code that extends 24 levels, increasing/decreasing the levels (IE reducing/increasing the size of our sub square) may yield more accurate results.
Comparing trie trees should be reasonably easy, and could possible provide effective results.
More ideas
I stumbled accross an interesting paper breif about classification of satellite imagery, it outlines:
Texture measures considered are: cooccurrence matrices, gray-level differences, texture-tone analysis, features derived from the Fourier spectrum, and Gabor filters. Some Fourier features and some Gabor filters were found to be good choices, in particular when a single frequency band was used for classification.
It may be worth investigating those measurements in more detail, although some of them may not be relevant to your data set.
Other things to consider
There are probably a lot of papers on this sort of thing, so reading some of them should help although they can be very technical. It is an extremely difficult area in computing, with many fruitless hours of work spent by many people attempting to do similar things. Keeping it simple and building upon those ideas would be the best way to go. It should be a reasonably difficult challenge to create an algorithm with a better than random match rate, and to start improving on that really does start to get quite hard to achieve.
Each method would probably need to be tested and tweaked thoroughly, if you have any information about the type of picture you will be checking as well, this would be useful. For example advertisements, many of them would have text in them, so doing text recognition would be an easy and probably very reliable way of finding matches especially when combined with other solutions. As mentioned earlier, attempt to exploit common properties of your data set.
Combining alternative measurements and techniques each that can have a weighted vote (dependant on their effectiveness) would be one way you could create a system that generates more accurate results.
If employing multiple algorithms, as mentioned at the begining of this answer, one may find all the positives but have a false positive rate of 20%, it would be of interest to study the properties/strengths/weaknesses of other algorithms as another algorithm may be effective in eliminating false positives returned from another.
Be careful to not fall into attempting to complete the never ending project, good luck!

Read the paper: Porikli, Fatih, Oncel Tuzel, and Peter Meer. “Covariance Tracking Using Model Update Based
on Means on Riemannian Manifolds”. (2006) IEEE Computer Vision and Pattern Recognition.
I was successfully able to detect overlapping regions in images captured from adjacent webcams using the technique presented in this paper. My covariance matrix was composed of Sobel, canny and SUSAN aspect/edge detection outputs, as well as the original greyscale pixels.

An idea:
use keypoint detectors to find scale- and transform- invariant descriptors of some points in the image (e.g. SIFT, SURF, GLOH, or LESH).
try to align keypoints with similar descriptors from both images (like in panorama stitching), allow for some image transforms if necessary (e.g. scale & rotate, or elastic stretching).
if many keypoints align well (exists such a transform, that keypoint alignment error is low; or transformation "energy" is low, etc.), you likely have similar images.
Step 2 is not trivial. In particular, you may need to use a smart algorithm to find the most similar keypoint on the other image. Point descriptors are usually very high-dimensional (like a hundred parameters), and there are many points to look through. kd-trees may be useful here, hash lookups don't work well.
Variants:
Detect edges or other features instead of points.

It is indeed much less simple than it seems :-) Nick's suggestion is a good one.
To get started, keep in mind that any worthwhile comparison method will essentially work by converting the images into a different form -- a form which makes it easier to pick similar features out. Usually, this stuff doesn't make for very light reading ...
One of the simplest examples I can think of is simply using the color space of each image. If two images have highly similar color distributions, then you can be reasonably sure that they show the same thing. At least, you can have enough certainty to flag it, or do more testing. Comparing images in color space will also resist things such as rotation, scaling, and some cropping. It won't, of course, resist heavy modification of the image or heavy recoloring (and even a simple hue shift will be somewhat tricky).
http://en.wikipedia.org/wiki/RGB_color_space
http://upvector.com/index.php?section=tutorials&subsection=tutorials/colorspace
Another example involves something called the Hough Transform. This transform essentially decomposes an image into a set of lines. You can then take some of the 'strongest' lines in each image and see if they line up. You can do some extra work to try and compensate for rotation and scaling too -- and in this case, since comparing a few lines is MUCH less computational work than doing the same to entire images -- it won't be so bad.
http://homepages.inf.ed.ac.uk/amos/hough.html
http://rkb.home.cern.ch/rkb/AN16pp/node122.html
http://en.wikipedia.org/wiki/Hough_transform

In the form described by you, the problem is tough. Do you consider copy, paste of part of the image into another larger image as a copy ? etc.
What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
If you take a step-back, this is easier to solve if you watermark the master images.
You will need to use a watermarking scheme to embed a code into the image. To take a step back, as opposed to some of the low-level approaches (edge detection etc) suggested by some folks, a watermarking method is superior because:
It is resistant to Signal processing attacks
► Signal enhancement – sharpening, contrast, etc.
► Filtering – median, low pass, high pass, etc.
► Additive noise – Gaussian, uniform, etc.
► Lossy compression – JPEG, MPEG, etc.
It is resistant to Geometric attacks
► Affine transforms
► Data reduction – cropping, clipping, etc.
► Random local distortions
► Warping
Do some research on watermarking algorithms and you will be on the right path to solving your problem. (
Note: You can benchmark you method using the STIRMARK dataset. It is an accepted standard for this type of application.

This is just a suggestion, it might not work and I'm prepared to be called on this.
This will generate false positives, but hopefully not false negatives.
Resize both of the images so that they are the same size (I assume that the ratios of widths to lengths are the same in both images).
Compress a bitmap of both images with a lossless compression algorithm (e.g. gzip).
Find pairs of files that have similar file sizes. For instance, you could just sort every pair of files you have by how similar the file sizes are and retrieve the top X.
As I said, this will definitely generate false positives, but hopefully not false negatives. You can implement this in five minutes, whereas the Porikil et. al. would probably require extensive work.

I believe if you're willing to apply the approach to every possible orientation and to negative versions, a good start to image recognition (with good reliability) is to use eigenfaces: http://en.wikipedia.org/wiki/Eigenface
Another idea would be to transform both images into vectors of their components. A good way to do this is to create a vector that operates in x*y dimensions (x being the width of your image and y being the height), with the value for each dimension applying to the (x,y) pixel value. Then run a variant of K-Nearest Neighbours with two categories: match and no match. If it's sufficiently close to the original image it will fit in the match category, if not then it won't.
K Nearest Neighbours(KNN) can be found here, there are other good explanations of it on the web too: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
The benefits of KNN is that the more variants you're comparing to the original image, the more accurate the algorithm becomes. The downside is you need a catalogue of images to train the system first.

If you're willing to consider a different approach altogether to detecting illegal copies of your images, you could consider watermarking. (from 1.4)
...inserts copyright information into the digital object without the loss of quality. Whenever the copyright of a digital object is in question, this information is extracted to identify the rightful owner. It is also possible to encode the identity of the original buyer along with the identity of the copyright holder, which allows tracing of any unauthorized copies.
While it's also a complex field, there are techniques that allow the watermark information to persist through gross image alteration: (from 1.9)
... any signal transform of reasonable strength cannot remove the watermark. Hence a pirate willing to remove the watermark will not succeed unless they debase the document too much to be of commercial interest.
of course, the faq calls implementing this approach: "...very challenging" but if you succeed with it, you get a high confidence of whether the image is a copy or not, rather than a percentage likelihood.

If you're running Linux I would suggest two tools:
align_image_stack from package hugin-tools - is a commandline program that can automatically correct rotation, scaling, and other distortions (it's mostly intended for compositing HDR photography, but works for video frames and other documents too). More information: http://hugin.sourceforge.net/docs/manual/Align_image_stack.html
compare from package imagemagick - a program that can find and count the amount of different pixels in two images. Here's a neat tutorial: http://www.imagemagick.org/Usage/compare/ uising the -fuzz N% you can increase the error tolerance. The higher the N the higher the error tolerance to still count two pixels as the same.
align_image_stack should correct any offset so the compare command will actually have a chance of detecting same pixels.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio