Obtaining motion vectors from raw video

Obtaining motion vectors from raw video - algorithm

I'd like to know if there is any good (and freely available) text, on how to obtain motion vectors of macro blocks in raw video stream. This is often used in video compression, although my application of it is not video encoding.
Code that does this is available in OSS codecs, but understanding the method by reading the code is kinda hard.
My actual goal is to determine camera motion in 2D projection space, assuming the camera is only changing it's orientation (NOT the position). What I'd like to do is divide the frames into macro blocks, obtain their motion vectors, and get the camera motion by averaging those vectors.
I guess OpenCV could help with this problem, but it's not available on my target platform.

The usual way is simple brute force: Compare a macro block to each macro block from the reference frame and use the one that gives the smallest residual error. The code gets complex primarily because this is usually the slowest part of mv-based compression, so they put a lot of work into optimizing it, often at the expense of anything even approaching readability.
Especially for real-time compression, some reduce the workload a bit by (for example) restricting the search to the original position +/- some maximum delta. This can often gain quite a bit of compression speed in exchange for a fairly small loss of compression.

If you assume only camera motion, I suspect there is something possible with analysis of the FFT of successive images. For frequencies whose amplitudes have not changed much, the phase information will indicate the camera motion. Not sure if this will help with camera rotation, but lateral and vertical motion can probably be computed. There will be difficulties due to new information appearing on one edge and disappearing on the other and I'm not sure how much that will hurt. This is speculative thinking in response to your question, so I have no proof or references :-)

Sounds like you're doing a very limited SLAM project?
Lots of reading matter at Bristol University, Imperial College, Oxford University for example - you might find their approaches to finding and matching candidate features from frame to frame of interest - much more robust than simple sums of absolute differences.

For the most low-level algorithms of this type the term you are looking for is optical flow and one of the easiest algorithms of that class is the Lucas Kanade algorithm.
This is a pretty good overview presentation that should give you plenty of ideas for an algorithm that does what you need

Related

image registration(non-rigid \ nonlinear)

I'm looking for some algorithm (preferably if source code available)
for image registration.
Image deformation can't been described by homography matrix(because I think that distortion not symmetrical and not
homogeneous),more specifically deformations are like barrel/distortion and trapezoid distortion, maybe some rotation of image.
I want to obtain pairs of pixel of two images and so i can obtain representation of "deformation field".
I google a lot and find out that there are some algorithm base on some phisics ideas, but it seems that they can converge
to local maxima, but not global.
I can affort program to be semi-automatic, it means some simple user interation.
maybe some algorithms like SIFT will be appropriate?
but I think it can't provide "deformation field" with regular sufficient density.
if it important there is no scale changes.
example of complicated field
http://www.math.ucla.edu/~yanovsky/Research/ImageRegistration/2DMRI/2DMRI_lambda400_grid_only1.png

What you are looking for is "optical flow". Searching for these terms will yield you numerous results.
In OpenCV, there is a function called calcOpticalFlowFarneback() (in the video module) that does what you want.
The C API does still have an implementation of the classic paper by Horn & Schunck (1981) called "Determining optical flow".
You can also have a look at this work I've done, along with some code (but be careful, there are still some mysterious bugs in the opencl memory code. I will release a corrected version later this year.): http://lts2www.epfl.ch/people/dangelo/opticalflow
Besides OpenCV's optical flow (and mine ;-), you can have a look at ITK on itk.org for complete image registration chains (mostly aimed at medical imaging).
There's also a lot of optical flow code (matlab, C/C++...) that can be found thanks to google, for example cs.brown.edu/~dqsun/research/software.html, gpu4vision, etc
-- EDIT : about optical flow --
Optical flow is divided in two families of algorithms : the dense ones, and the others.
Dense algorithms give one motion vector per pixel, non-dense ones one vector per tracked feature.
Examples of the dense family include Horn-Schunck and Farneback (to stay with opencv), and more generally any algorithm that will minimize some cost function over the whole images (the various TV-L1 flows, etc).
An example for the non-dense family is the KLT, which is called Lucas-Kanade in opencv.
In the dense family, since the motion for each pixel is almost free, it can deal with scale changes. Keep in mind however that these algorithms can fail in the case of large motions / scales changes because they usually rely on linearizations (Taylor expansions of the motion and image changes). Furthermore, in the variational approach, each pixel contributes to the overall result. Hence, parts that are invisible in one image are likely to deviate the algorithm from the actual solution.
Anyway, techniques such as coarse-to-fine implementations are employed to bypass these limits, and these problems have usually only a small impact. Brutal illumination changes, or large occluded / unoccluded areas can also be explicitly dealt with by some algorithms, see for example this paper that computes a sparse image of "innovation" alongside the optical flow field.

i found some software medical specific, but it's complicate and it's not work with simple image formats, but seems that it do that I need.
http://www.csd.uoc.gr/~komod/FastPD/index.html
Drop - Deformable Registration using Discrete Optimization

SURF and SIFT Alternative Object Tracking Algorithm for Augmented Reality

After asking here and trying both SURF and SIFT, neither of them seams to be efficient enough to generate interest points fast enough to track a stream from the camera.
SURF, for example, takes around 3 seconds to generate interest points for an image, that's way too slow to track a video coming from a web cam, and it'll be even worse when using it on a mobile phone.
I just need an algorithm that tracks a certain area, its scale, tilt, etc.. and I can build on top of that.
Thanks

I suspect your SURF usage may need some alteration?
Here is a link to an MIT paper on using SURF for augmented reality applications on mobile devices.
Excerpt:
In this section, we present our
implementation of the SURF al- gorithm
and its adaptation to the mobile
phone. Next, we discuss the impact
that accuracy has on the speed of the
nearest-neighbor search and show that
we can achieve an order of magnitude
speed- up with minimal impact on
matching accuracy. Finally, we dis-
cuss the details of the phone
implementation of the image matching
pipeline. We study the performance,
memory use, and bandwidth consumption
on the phone.
You might also want to look into OpenCV's algorithms because they are tried and tested.
Depending on the constraints of your application, you may be able to reduce the genericness of those algorithms to look for known POIs and markers within the image.
Part of tracking a POI is estimating its vector from one point in the 2D image to another, and then optionally confirming that it still exists there (through pixel characteristics). The same approach can be used to track (not re-scan the entire image) for POI and POI group/object perspective and rotation changes.
There are tons of papers online for tracking objects on a 2D projection (up to a servere skew in many cases).
Good Luck!

You should try FAST detector
http://svr-www.eng.cam.ac.uk/~er258/work/fast.html

We are using SURF for a project and we found OpenSURF to outmatch OpenCV's SURF implementation in raw speed and performance. We still haven´t tested repeatability and accuracy, but it is way faster.
Update:
I just wanted to point out that you needn't perform a SURF match step in each frame, you could simply do it every other frame and interpolate the position of the object in the frame you don't execute SURF on.

You can use a simpler algorithm if you would make stricter restrictions on the area you would like to be tracked. As you surely know, ARToolKit is pretty fast, but only tracks black and white markers with a very distinct frame.
If you want a (somewhat) general purpose tracker, you may want to check PTAM. The site (http://www.robots.ox.ac.uk/~gk/PTAM/) is currently down, but here's a snazzy video of it working on an iPhone (http://www.youtube.com/watch?v=pBI5HwitBX4)

As others have mentioned, three seconds seems unusually long. While testing the SURF implementation in the Mahotas library, I found that it took on average 0.36sec, even with some fairly large images (e.g. 1024x768). And that's with a mix of Python and C, so I'd imagine some other pure-C implementations would be even faster.

I found this nice comparison of each feature detection algorithms at http://computer-vision-talks.com/2011/01/comparison-of-the-opencvs-feature-detection-algorithms-2/
Have a look. It might be useful!
According to that comparison, and as mirror2image has also suggested, FAST is the best choice. But it depends on what you really want to achieve.

One option I've used in constrained embedded systems is to use a simpler interest point detector: FAST or Shi-Tomasi for example. I used Shi-Tomasi, as I was targetting an FPGA and could easily run it at pixel rate with no significant buffering required.
Then use SURF to generate the descriptors for the image patch around the identified features and use those for matching and tracking purposes.

Novel fitness measure for evolutionary image matching simulation

I'm sure many people have already seen demos of using genetic algorithms to generate an image that matches a sample image. You start off with noise, and gradually it comes to resemble the target image more and more closely, until you have a more-or-less exact duplicate.
All of the examples I've seen, however, use a fairly straightforward pixel-by-pixel comparison, resulting in a fairly predictable 'fade in' of the final image. What I'm looking for is something more novel: A fitness measure that comes closer to what we see as 'similar' than the naive approach.
I don't have a specific result in mind - I'm just looking for something more 'interesting' than the default. Suggestions?

I assume you're talking about something like Roger Alsing's program.
I implemented a version of this, so I'm also interested in alternative fitness functions, though I'm coming at it from the perspective of improving performance rather than aesthetics. I expect there will always be some element of "fade-in" due to the nature of the evolutionary process (though tweaking the evolutionary operators may affect how this looks).
A pixel-by-pixel comparison can be expensive for anything but small images. For example, the 200x200 pixel image I use has 40,000 pixels. With three values per pixel (R, G and B), that's 120,000 values that have to be incorporated into the fitness calculation for a single image. In my implementation I scale the image down before doing the comparison so that there are fewer pixels. The trade-off is slightly reduced accuracy of the evolved image.
In investigating alternative fitness functions I came across some suggestions to use the YUV colour space instead of RGB since this is more closely aligned with human perception.
Another idea that I had was to compare only a randomly selected sample of pixels. I'm not sure how well this would work without trying it. Since the pixels compared would be different for each evaluation it would have the effect of maintaining diversity within the population.
Beyond that, you are in the realms of computer vision. I expect that these techniques, which rely on feature extraction, would be more expensive per image, but they may be faster overall if they result in fewer generations being required to achieve an acceptable result. You might want to investigate the PerceptualDiff library. Also, this page shows some Java code that can be used to compare images for similarity based on features rather than pixels.

A fitness measure that comes closer to what we see as 'similar' than the naive approach.
Implementing such a measure in software is definitely nontrivial. Google 'Human vision model', 'perceptual error metric' for some starting points. You can sidestep the issue - just present the candidate images to a human for selecting the best ones, although it might be a bit boring for the human.

I haven't seen such a demo (perhaps you could link one). But a couple proto-ideas from your desription that may trigger an interesting one:
Three different algorithms running in parallel, perhaps RGB or HSV.
Move, rotate, or otherwise change the target image slightly during the run.
Fitness based on contrast/value differences between pixels, but without knowing the actual colour.
...then "prime" a single pixel with the correct colour?

I would agree with other contributors that this is non-trivial. I'd also add that it would be very valuable commercially - for example, companies who wish to protect their visual IP would be extremely happy to be able to trawl the internet looking for similar images to their logos.
My naïve approach to this would be to train a pattern recognizer on a number of images, each generated from the target image with one or more transforms applied to it: e.g. rotated a few degrees either way; a translation a few pixels either way; different scales of the same image; various blurs and effects (convolution masks are good here). I would also add some randomness noise to the each of the images. The more samples the better.
The training can all be done off-line, so shouldn't cause a problem with runtime performance.
Once you've got a pattern recognizer trained, you can point it at the the GA population images, and get some scalar score out of the recognizers.
Personally, I like Radial Basis Networks. Quick to train. I'd start with far too many inputs, and whittle them down with principle component analysis (IIRC). The outputs could just be a similiarity measure and dissimilarity measure.
One last thing; whatever approach you go for - could you blog about it, publish the demo, whatever; let us know how you got on.

Finding an interesting frame in a video

Does anyone know of an algorithm that I could use to find an "interesting" representative thumbnail for a video?
I have say 30 bitmaps and I would like to choose the most representative one as the video thumbnail.
The obvious first step would be eliminate all black frames. Then perhaps look for the "distance" between the various frames and choose something that is close to the avg.
Any ideas here or published papers that could help out?

If the video contains structure, i.e. several shots, then the standard techniques for video summarisation involve (a) shot detection, then (b) use the first, mid, or nth frame to represent each shot. See [1].
However, let us assume you wish to find an interesting frame in a single continuous stream of frames taken from a single camera source. I.e. a shot. This is the "key frame detection" problem that is widely discussed in IR/CV (Information Retrieval, Computer Vision) texts. Some illustrative approaches:
In [2] a mean colour histogram is computed for all frames and the key-frame is that with the closest histogram. I.e. we select the best frame in terms of it's colour distribution.
In [3] we assume that camera stillness is an indicator of frame importance. As suggested by Beds, above. We pick the still frames using optic-flow and use that.
In [4] each frame is projected into some high dimensional content space, we find those frames at the corners of the space and use them to represent the video.
In [5] frames are evaluated for importance using their length and novelty in content space.
In general, this is a large field and there are lots of approaches. You can look at the academic conferences such as The International Conference on Image and Video Retrieval (CIVR) for the latest ideas. I find that [6] presents a useful detailed summary of video abstraction (key-frame detection and summarisation).
For your "find the best of 30 bitmaps" problem I would use an approach like [2]. Compute a frame representation space (e.g. a colour histogram for the frame), compute a histogram to represent all frames, and use the frame with the minimum distance between the two (e.g. pick a distance metric that's best for your space. I would try Earth Mover's Distance).
M.S. Lew. Principles of Visual Information Retrieval. Springer Verlag, 2001.
B. Gunsel, Y. Fu, and A.M. Tekalp. Hierarchical temporal video segmentation and content characterization. Multimedia Storage and Archiving Systems II, SPIE, 3229:46-55, 1997.
W. Wolf. Key frame selection by motion analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1228-1231, 1996.
L. Zhao, W. Qi, S.Z. Li, S.Q. Yang, and H.J. Zhang. Key-frame extraction and shot retrieval using Nearest Feature Line. In IW-MIR, ACM MM, pages 217-220, 2000.
S. Uchihashi. Video Manga: Generating semantically meaningful video summaries.
In Proc. ACM Multimedia 99, Orlando, FL, Nov., pages 383-292, 1999.
Y. Li, T. Zhang, and D. Tretter. An overview of video abstraction techniques. Technical report, HP Laboratory, July 2001.

You asked for papers so I found a few. If you are not on campus or on VPN connection to campus these papers might be hard to reach.
PanoramaExcerpts: extracting and packing panoramas for video browsing
http://portal.acm.org/citation.cfm?id=266396
This one explains a method for generating a comicbook style keyframe representation.
Abstract:
This paper presents methods for automatically creating pictorial video summaries that resem- ble comic books. The relative importance of video segments is computed from their length and novelty. Image and audio analysis is used to automatically detect and emphasize mean- ingful events. Based on this importance mea- sure, we choose relevant keyframes. Selected keyframes are sized by importance, and then efficiently packed into a pictorial summary. We present a quantitative measure of how well a summary captures the salient events in a video, and show how it can be used to improve our summaries. The result is a compact and visually pleasing summary that captures semantically important events, and is suitable for printing or Web access. Such a summary can be further enhanced by including text cap- tions derived from OCR or other methods. We describe how the automatically generated sum- maries are used to simplify access to a large collection of videos.
Automatic extraction of representative keyframes based on scenecontent
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=751008
Abstract:
Generating indices for movies is a tedious and expensive process which we seek to automate. While algorithms for finding scene boundaries are readily available, there has been little work performed on selecting individual frames to concisely represent the scene. In this paper we present novel algorithms for automated selection of representative keyframes, based on scene content. Detailed description of several algorithms is followed by an analysis of how well humans feel the selected frames represent the scene. Finally we address how these algorithms can be integrated with existing algorithms for finding scene boundaries.

I think you should only look at key frames.
If the video is not encoded using a compression which is based on key frames, you create an algorithm based on the following article: Key frame selection by motion analysis.
Depending on the compression of the video you can have key frames every 2 seconds or 30 seconds. Than I think you should use the algorithm in the article to find the "most" keyframe out of all the key frames.

It may also be beneficial to favor frames that are aesthetically pleasing. That is, look for common attributes of photography-- aspect ratio, contrast, balance, etc.
It would be hard to find a representative shot if you don't know what you're looking for. But with some heuristics and my suggestion, at least you could come up with something good looking.

I worked on a project recently where we did some video processing, and we used OpenCV to do the heavy lifting as far as video processing was concerned. We had to extract frames, calculate differences, extract faces, etc. OpenCV has some built-in algorithms that will calculate differences between frames. It works with a variety of video and image formats.

Wow, what a great question - I guess a second step would be to iteratively remove frames where there's little or no change between it and it's successors. But all you're really doing there is reducing the set of potentially interesting frames. How exactly you determine "interestingness" is the special sauce I suppose as you don't have the user interaction statistics to rely on like Flickr does.

Directors will sometimes linger on a particularly 'insteresting' or beautiful shot so how about finding a 5 second section that doesn't change and then eliminating those sections that are almost black?

Algorithm to compare two images

Given two different image files (in whatever format I choose), I need to write a program to predict the chance if one being the illegal copy of another. The author of the copy may do stuff like rotating, making negative, or adding trivial details (as well as changing the dimension of the image).
Do you know any algorithm to do this kind of job?

These are simply ideas I've had thinking about the problem, never tried it but I like thinking about problems like this!
Before you begin
Consider normalising the pictures, if one is a higher resolution than the other, consider the option that one of them is a compressed version of the other, therefore scaling the resolution down might provide more accurate results.
Consider scanning various prospective areas of the image that could represent zoomed portions of the image and various positions and rotations. It starts getting tricky if one of the images are a skewed version of another, these are the sort of limitations you should identify and compromise on.
Matlab is an excellent tool for testing and evaluating images.
Testing the algorithms
You should test (at the minimum) a large human analysed set of test data where matches are known beforehand. If for example in your test data you have 1,000 images where 5% of them match, you now have a reasonably reliable benchmark. An algorithm that finds 10% positives is not as good as one that finds 4% of positives in our test data. However, one algorithm may find all the matches, but also have a large 20% false positive rate, so there are several ways to rate your algorithms.
The test data should attempt to be designed to cover as many types of dynamics as possible that you would expect to find in the real world.
It is important to note that each algorithm to be useful must perform better than random guessing, otherwise it is useless to us!
You can then apply your software into the real world in a controlled way and start to analyse the results it produces. This is the sort of software project which can go on for infinitum, there are always tweaks and improvements you can make, it is important to bear that in mind when designing it as it is easy to fall into the trap of the never ending project.
Colour Buckets
With two pictures, scan each pixel and count the colours. For example you might have the 'buckets':
white
red
blue
green
black
(Obviously you would have a higher resolution of counters). Every time you find a 'red' pixel, you increment the red counter. Each bucket can be representative of spectrum of colours, the higher resolution the more accurate but you should experiment with an acceptable difference rate.
Once you have your totals, compare it to the totals for a second image. You might find that each image has a fairly unique footprint, enough to identify matches.
Edge detection
How about using Edge Detection.
(source: wikimedia.org)
With two similar pictures edge detection should provide you with a usable and fairly reliable unique footprint.
Take both pictures, and apply edge detection. Maybe measure the average thickness of the edges and then calculate the probability the image could be scaled, and rescale if necessary. Below is an example of an applied Gabor Filter (a type of edge detection) in various rotations.
Compare the pictures pixel for pixel, count the matches and the non matches. If they are within a certain threshold of error, you have a match. Otherwise, you could try reducing the resolution up to a certain point and see if the probability of a match improves.
Regions of Interest
Some images may have distinctive segments/regions of interest. These regions probably contrast highly with the rest of the image, and are a good item to search for in your other images to find matches. Take this image for example:
(source: meetthegimp.org)
The construction worker in blue is a region of interest and can be used as a search object. There are probably several ways you could extract properties/data from this region of interest and use them to search your data set.
If you have more than 2 regions of interest, you can measure the distances between them. Take this simplified example:
(source: per2000.eu)
We have 3 clear regions of interest. The distance between region 1 and 2 may be 200 pixels, between 1 and 3 400 pixels, and 2 and 3 200 pixels.
Search other images for similar regions of interest, normalise the distance values and see if you have potential matches. This technique could work well for rotated and scaled images. The more regions of interest you have, the probability of a match increases as each distance measurement matches.
It is important to think about the context of your data set. If for example your data set is modern art, then regions of interest would work quite well, as regions of interest were probably designed to be a fundamental part of the final image. If however you are dealing with images of construction sites, regions of interest may be interpreted by the illegal copier as ugly and may be cropped/edited out liberally. Keep in mind common features of your dataset, and attempt to exploit that knowledge.
Morphing
Morphing two images is the process of turning one image into the other through a set of steps:
Note, this is different to fading one image into another!
There are many software packages that can morph images. It's traditionaly used as a transitional effect, two images don't morph into something halfway usually, one extreme morphs into the other extreme as the final result.
Why could this be useful? Dependant on the morphing algorithm you use, there may be a relationship between similarity of images, and some parameters of the morphing algorithm.
In a grossly over simplified example, one algorithm might execute faster when there are less changes to be made. We then know there is a higher probability that these two images share properties with each other.
This technique could work well for rotated, distorted, skewed, zoomed, all types of copied images. Again this is just an idea I have had, it's not based on any researched academia as far as I am aware (I haven't look hard though), so it may be a lot of work for you with limited/no results.
Zipping
Ow's answer in this question is excellent, I remember reading about these sort of techniques studying AI. It is quite effective at comparing corpus lexicons.
One interesting optimisation when comparing corpuses is that you can remove words considered to be too common, for example 'The', 'A', 'And' etc. These words dilute our result, we want to work out how different the two corpus are so these can be removed before processing. Perhaps there are similar common signals in images that could be stripped before compression? It might be worth looking into.
Compression ratio is a very quick and reasonably effective way of determining how similar two sets of data are. Reading up about how compression works will give you a good idea why this could be so effective. For a fast to release algorithm this would probably be a good starting point.
Transparency
Again I am unsure how transparency data is stored for certain image types, gif png etc, but this will be extractable and would serve as an effective simplified cut out to compare with your data sets transparency.
Inverting Signals
An image is just a signal. If you play a noise from a speaker, and you play the opposite noise in another speaker in perfect sync at the exact same volume, they cancel each other out.
(source: themotorreport.com.au)
Invert on of the images, and add it onto your other image. Scale it/loop positions repetitively until you find a resulting image where enough of the pixels are white (or black? I'll refer to it as a neutral canvas) to provide you with a positive match, or partial match.
However, consider two images that are equal, except one of them has a brighten effect applied to it:
(source: mcburrz.com)
Inverting one of them, then adding it to the other will not result in a neutral canvas which is what we are aiming for. However, when comparing the pixels from both original images, we can definatly see a clear relationship between the two.
I haven't studied colour for some years now, and am unsure if the colour spectrum is on a linear scale, but if you determined the average factor of colour difference between both pictures, you can use this value to normalise the data before processing with this technique.
Tree Data structures
At first these don't seem to fit for the problem, but I think they could work.
You could think about extracting certain properties of an image (for example colour bins) and generate a huffman tree or similar data structure. You might be able to compare two trees for similarity. This wouldn't work well for photographic data for example with a large spectrum of colour, but cartoons or other reduced colour set images this might work.
This probably wouldn't work, but it's an idea. The trie datastructure is great at storing lexicons, for example a dictionarty. It's a prefix tree. Perhaps it's possible to build an image equivalent of a lexicon, (again I can only think of colours) to construct a trie. If you reduced say a 300x300 image into 5x5 squares, then decompose each 5x5 square into a sequence of colours you could construct a trie from the resulting data. If a 2x2 square contains:
FFFFFF|000000|FDFD44|FFFFFF
We have a fairly unique trie code that extends 24 levels, increasing/decreasing the levels (IE reducing/increasing the size of our sub square) may yield more accurate results.
Comparing trie trees should be reasonably easy, and could possible provide effective results.
More ideas
I stumbled accross an interesting paper breif about classification of satellite imagery, it outlines:
Texture measures considered are: cooccurrence matrices, gray-level differences, texture-tone analysis, features derived from the Fourier spectrum, and Gabor filters. Some Fourier features and some Gabor filters were found to be good choices, in particular when a single frequency band was used for classification.
It may be worth investigating those measurements in more detail, although some of them may not be relevant to your data set.
Other things to consider
There are probably a lot of papers on this sort of thing, so reading some of them should help although they can be very technical. It is an extremely difficult area in computing, with many fruitless hours of work spent by many people attempting to do similar things. Keeping it simple and building upon those ideas would be the best way to go. It should be a reasonably difficult challenge to create an algorithm with a better than random match rate, and to start improving on that really does start to get quite hard to achieve.
Each method would probably need to be tested and tweaked thoroughly, if you have any information about the type of picture you will be checking as well, this would be useful. For example advertisements, many of them would have text in them, so doing text recognition would be an easy and probably very reliable way of finding matches especially when combined with other solutions. As mentioned earlier, attempt to exploit common properties of your data set.
Combining alternative measurements and techniques each that can have a weighted vote (dependant on their effectiveness) would be one way you could create a system that generates more accurate results.
If employing multiple algorithms, as mentioned at the begining of this answer, one may find all the positives but have a false positive rate of 20%, it would be of interest to study the properties/strengths/weaknesses of other algorithms as another algorithm may be effective in eliminating false positives returned from another.
Be careful to not fall into attempting to complete the never ending project, good luck!

Read the paper: Porikli, Fatih, Oncel Tuzel, and Peter Meer. “Covariance Tracking Using Model Update Based
on Means on Riemannian Manifolds”. (2006) IEEE Computer Vision and Pattern Recognition.
I was successfully able to detect overlapping regions in images captured from adjacent webcams using the technique presented in this paper. My covariance matrix was composed of Sobel, canny and SUSAN aspect/edge detection outputs, as well as the original greyscale pixels.

An idea:
use keypoint detectors to find scale- and transform- invariant descriptors of some points in the image (e.g. SIFT, SURF, GLOH, or LESH).
try to align keypoints with similar descriptors from both images (like in panorama stitching), allow for some image transforms if necessary (e.g. scale & rotate, or elastic stretching).
if many keypoints align well (exists such a transform, that keypoint alignment error is low; or transformation "energy" is low, etc.), you likely have similar images.
Step 2 is not trivial. In particular, you may need to use a smart algorithm to find the most similar keypoint on the other image. Point descriptors are usually very high-dimensional (like a hundred parameters), and there are many points to look through. kd-trees may be useful here, hash lookups don't work well.
Variants:
Detect edges or other features instead of points.

It is indeed much less simple than it seems :-) Nick's suggestion is a good one.
To get started, keep in mind that any worthwhile comparison method will essentially work by converting the images into a different form -- a form which makes it easier to pick similar features out. Usually, this stuff doesn't make for very light reading ...
One of the simplest examples I can think of is simply using the color space of each image. If two images have highly similar color distributions, then you can be reasonably sure that they show the same thing. At least, you can have enough certainty to flag it, or do more testing. Comparing images in color space will also resist things such as rotation, scaling, and some cropping. It won't, of course, resist heavy modification of the image or heavy recoloring (and even a simple hue shift will be somewhat tricky).
http://en.wikipedia.org/wiki/RGB_color_space
http://upvector.com/index.php?section=tutorials&subsection=tutorials/colorspace
Another example involves something called the Hough Transform. This transform essentially decomposes an image into a set of lines. You can then take some of the 'strongest' lines in each image and see if they line up. You can do some extra work to try and compensate for rotation and scaling too -- and in this case, since comparing a few lines is MUCH less computational work than doing the same to entire images -- it won't be so bad.
http://homepages.inf.ed.ac.uk/amos/hough.html
http://rkb.home.cern.ch/rkb/AN16pp/node122.html
http://en.wikipedia.org/wiki/Hough_transform

In the form described by you, the problem is tough. Do you consider copy, paste of part of the image into another larger image as a copy ? etc.
What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
If you take a step-back, this is easier to solve if you watermark the master images.
You will need to use a watermarking scheme to embed a code into the image. To take a step back, as opposed to some of the low-level approaches (edge detection etc) suggested by some folks, a watermarking method is superior because:
It is resistant to Signal processing attacks
► Signal enhancement – sharpening, contrast, etc.
► Filtering – median, low pass, high pass, etc.
► Additive noise – Gaussian, uniform, etc.
► Lossy compression – JPEG, MPEG, etc.
It is resistant to Geometric attacks
► Affine transforms
► Data reduction – cropping, clipping, etc.
► Random local distortions
► Warping
Do some research on watermarking algorithms and you will be on the right path to solving your problem. (
Note: You can benchmark you method using the STIRMARK dataset. It is an accepted standard for this type of application.

This is just a suggestion, it might not work and I'm prepared to be called on this.
This will generate false positives, but hopefully not false negatives.
Resize both of the images so that they are the same size (I assume that the ratios of widths to lengths are the same in both images).
Compress a bitmap of both images with a lossless compression algorithm (e.g. gzip).
Find pairs of files that have similar file sizes. For instance, you could just sort every pair of files you have by how similar the file sizes are and retrieve the top X.
As I said, this will definitely generate false positives, but hopefully not false negatives. You can implement this in five minutes, whereas the Porikil et. al. would probably require extensive work.

I believe if you're willing to apply the approach to every possible orientation and to negative versions, a good start to image recognition (with good reliability) is to use eigenfaces: http://en.wikipedia.org/wiki/Eigenface
Another idea would be to transform both images into vectors of their components. A good way to do this is to create a vector that operates in x*y dimensions (x being the width of your image and y being the height), with the value for each dimension applying to the (x,y) pixel value. Then run a variant of K-Nearest Neighbours with two categories: match and no match. If it's sufficiently close to the original image it will fit in the match category, if not then it won't.
K Nearest Neighbours(KNN) can be found here, there are other good explanations of it on the web too: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
The benefits of KNN is that the more variants you're comparing to the original image, the more accurate the algorithm becomes. The downside is you need a catalogue of images to train the system first.

If you're willing to consider a different approach altogether to detecting illegal copies of your images, you could consider watermarking. (from 1.4)
...inserts copyright information into the digital object without the loss of quality. Whenever the copyright of a digital object is in question, this information is extracted to identify the rightful owner. It is also possible to encode the identity of the original buyer along with the identity of the copyright holder, which allows tracing of any unauthorized copies.
While it's also a complex field, there are techniques that allow the watermark information to persist through gross image alteration: (from 1.9)
... any signal transform of reasonable strength cannot remove the watermark. Hence a pirate willing to remove the watermark will not succeed unless they debase the document too much to be of commercial interest.
of course, the faq calls implementing this approach: "...very challenging" but if you succeed with it, you get a high confidence of whether the image is a copy or not, rather than a percentage likelihood.

If you're running Linux I would suggest two tools:
align_image_stack from package hugin-tools - is a commandline program that can automatically correct rotation, scaling, and other distortions (it's mostly intended for compositing HDR photography, but works for video frames and other documents too). More information: http://hugin.sourceforge.net/docs/manual/Align_image_stack.html
compare from package imagemagick - a program that can find and count the amount of different pixels in two images. Here's a neat tutorial: http://www.imagemagick.org/Usage/compare/ uising the -fuzz N% you can increase the error tolerance. The higher the N the higher the error tolerance to still count two pixels as the same.
align_image_stack should correct any offset so the compare command will actually have a chance of detecting same pixels.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio