How do font ID algorithms work? - algorithm

I was wondering how automatic font identification services (like WhatTheFont, not question-based ones like Identifont) work. The most basic variant would be a service that lets you upload an image that contains text, and the service returns the name of the font used. How is this done, and how is it done so fast as to be practical? I'm fairly new to this kind of thing, but here's my understanding so far:
Perhaps some pre-processing to reduce noise. I'm not particularly interested in this part.
First the image is run through an OCR to extract the text – simple enough.
Then you go through every font in the tens/hundreds-of-thousands in your database and render the text you have extracted in each one, seeing if it's close to the original. Adjusting for size, alignment, kerning, different weights or italics, etc. How is this possibly fast enough to be practical?
Is this correct?
Please offer some insight into how this is done, and how it's done efficiently.

Let us assume you are doing the match in the raster representation (not on vectorized outlines).
Indeed, the text should be recognized first to reduce the number of comparisons with the characters in the reference fonts; at this stage it matters to avoid any dubious recognition as this would wreak havoc.
Then a stage of normalization is needed: you can transform the character position, size (and possibly italics angle ?) to a standard bounding box so that pixel-by-pixel comparison becomes possible. Then the amount of computation will be proportional to the area of the characters times the number of reference fonts.
Beware that normalization in size is not fully accurate as big character shrunk will differ from a smaller character in the same font in a few details and stroke thickness. It is probably useful to consider two or three representative sizes per font.

Related

How to compare two contours? (font comparison)

I'm trying to analyse two contours and give a percent corresponding to their similarity. Assuming I have all the point's coordinates describing these contours (just like an SVG path), based on which factor should I tell they're almost identical ?
After some Google searches, I found something related to Fourier descriptors, are they relevant for my case ?
Edit
What I want to do is to compare several fonts to another one. Just like would do What the font, but not with an image. Thanks to the produced algorithm, it would be possible to find a font equivalent according to the similarity percentage.
Some scripts just compare the bounding box for each letters, but it's not enough. I need a way to tell that Arial is closest to Verdana than to Webdings. So assuming I can extract the contour from the fonts, I need a way to compare two contours.
For example (with "logical" percent values):
there are two basic ways to approach the general problem (font matching): symbolic and statistical. a good solution will probably combine both in some way.
a symbolic approach uses your knowledge of the problem in a direct way. for example, you can make a list of the things you (as an intelligent human) would use to characterise fonts. the kind of questions that identifont uses. this approach means writing routines that are smart enough to detect the various properties (eg stroke width, whether certain loops are closed, existence of serifs, etc) plus a decision tree (or "rule engine") that puts the yes/no/unsure answers together and comes up with an answer.
the statistical approach sounds more like what you were thinking about, and is probably how what the font works. here the idea is to find some general properties and use those as weights to find a "best" selection. for example, if you have lots of fonts then you can train a neural net (input being pixels at some sample resolution). there you don't need to know "how" the net decides - just that given enough training data it will find a way to do so. or you could just look at the sum of all the dark pixels - that would likely give you results similar to your percentages above.
this sounds simple, but often it's not so easy to find simple statistical measurements that show differences well in all the ways you want.
so then there's a large middle ground between the two. the idea being that if you can pull in some of the ideas from the first group then you can make the approaches in the second much more efficient. while the simplest neural net approach is "all in one" (it includes the calculations and the decisions) you can separate those out. so instead of just giving the net a bunch of pixels you can give it more "meaningful" inputs - things that you know help detect between different fonts. things like stroke width, or the number of "holes" in the character. you can also add some smarts to remove things that might otherwise confuse results - for example, pre-scaling to the same height (if you have a full font set then you can scale everything so that the height of a lowercase "m", say, is constant).
fourier descriptors are a way of characterising the "outside shape" of something and so could be used as an input to a statistical approach as i've described above. in the example you give the fourier descriptors will pick up the "spikiness" of the serifs in the lower G, and so would indicate that it is very different from the G on the left. but they care much less about stroke width and nothing at all about scale (magnification/zoom) (which can be a good or bad thing - if you're being given random letters of different sizes, you don't want to be sensitive to size, but if you've normalized to a standard "m" for an entire alphabet then you certainly do want to include that). since the output is just a spectrum you can compare different letters by cross-correlation of use something like PCA to categorize different types of letter.
other ideas would be 2d cross-correlation (the maximum of the normalised correlation gives you some idea of how similar two things are) or simply seeing what fraction of pixels are common in both letters.
as the comments say, this is a huge problem (and i am not an expert - the above is just random bullshit from being an interested bystander).
but, to finally answer your question, if what you have is an outline, then a fourier descriptor would be a good place to start. since that focuses on shape rather than "weight" i would combine that with something like total area enclosed by the outline. then write some code to calculate those and see what numbers you get for some example alphabets. if it seems to distinguish some letters, but not others, then look for some other measurements that would help in those cases. you will likely end up combining quite a few approaches to get something both fast and reliable.
alternatively, if you just want something simple, try using some easy-to-measure values like height, width, total number of pixels "inside" the contours, how many strokes you cross along vertical or horizontal lines, etc. combining a bunch of those could get you something "good enough" for some purposes, if you aren't comfortable with the maths involved in fourier transforms etc.
Have you considered using a neural network based approach? This paper uses a Self-Organizing Tree map to perform content based image retrieval. With a good training set, it should be possible to create a multilayer network (or SOM) that can give you an accurate similarity measure.

Detecting empty pages in scanned documents

So we need to detect whether an image, created by a scanner, represents an empty page. I'm way out of my depth when it comes to image processing, so I have to run this by the community.
Here's what I have come up with so far:
Empty pages can be glaringly white, gray recycled paper, or yellowed old paper. The current idea is to create a histogram for a page, look for a steep increase of the curve, and get the percentage of pixels are darker than that. If that exceeds a threshold, the page is likely not empty.
Since this would likely classify a page containing a single line of text at the top as empty, we would tile the page and gather statistics about each tile.
We would need to detect scanned staplers and holes from binding (likely only in certain tiles), but this can be put off to some later stage. However, if you have an idea what to look out for besides these two, please mention it in a comment.
This needs to be fast. It's part of a document processing workflow that processes (tens of) thousands of pages per day. If processing a page takes ten seconds longer, than our customers will have to tell their customers that they'll have to wait several days longer for their results. (If this results in more false positives, some customers would rather have someone check a few dozen found "empty" pages, than have their customer wait one more day.)
So here's my questions:
Is it a good idea to take this route or is there something better?
If we do it this way, how would I do this? What's a good (cheap) algorithm for finding a threshold for a page? Could we gain significant speed by assuming a similar threshold for a batch of documents? To which precision could brightness values be rounded, before getting logged? What quirks could we expect?
If you know that a scanned page is going to fill the image entirely, then calculating the standard deviation might be a good way of doing this.
I would suggest blurring page slightly to reduce some noise. Then calculate the SD for the page, in theory, a page the is more or less all one colour will have a low SD and one with lots of text will have a higher SD. Then it's a case of 'training' the system to work out when a page is plain and when it is text. You might find that certain pages are hard for it to tell.
You could have it trained by having it process a vast number of pages, and it goes through them all, and you say if it is plain or not.
EDIT
ok, a white page with black text, if we have just the page and no surrounding stuff, will have a mean colour of grey, probably a fairly light grey. Getting the average is a for loop through all the pixels, adding their values and then dividing by the number of pixels. I'm not good with this o(logN) stuff, but suffice to say, it will not that long. Unless you have HUGE images.
SD is a second for loop, this time we are counting up how different each pixel is from the mean, and then dividing by the mean. This will take a bit longer then the mean, as we have to do something like
diff = thispixel - mean;
if(diff < 0) {
diff = -diff;
}
runningTotal += diff;
For a plain coloured page, each pixel will be close to the mean value, thus our SD will be low. If the SD is below a certain value, we can assume that this means the page is all one colour.
This might have problems if their is very minimal amount of text, as it will not have a large influence on the SD, so maybe like you suggested in the question, break the page into sections. I suggest strips horizontally, as text tends to go this way. If we do one of these strips one at a time, once one strip suggests it has text, we can stop as we don't care if the rest is blank or not.
Blurring the page will help reduce noise, as the odd pixel of noise will be reduced in its impact, thus give you a 'tighter' SD. You could also use it to reduce the resolution of your image.
Say you sauce image is 300 wide by 900 high, you could sample pixels in blocks of nine, 3 *3, and thus end up with an image that is 100 wide by 300 high, so it can actually be used to reduce the amount of calculations you need to do, in this case by a ninth!
The main problem is going to be in working out how high an SD can be with just a plain page. Maybe have it find the SD of a load of blank pages.
By the sounds of it, you are probably going to want to have a middle ground that lets it be unsure and ask for human intervention, possibly letting the human value train the system to get better?
Perform some sort of simple edge detection. If the number of pixels constituting edges is below some threshold, then there's going to be a high probability the page is empty. This could be improved by classifying certain edges that correspond with high certainty (by shape and location) to punched holes and staples as trivial and discounting them from the metric.
When I worked for a document processor (~8 years ago), we handled client projects varying from very "clean" only-US-letter-sized pages to cover-/cardstock of irregular shapes mixed with normal pages. Operators fed pre-sorted files into scanning machines and only had to watch for folded corners and similar mechanical problems. Their output was multiple streams of hundreds of images corresponding to a range of files. A single scanner operator could easily scan 15k pieces of paper in a shift (that's only 0.60 pages/sec, while a scanner at speed could handle 3 pages/sec and still scan both sides). Later operators processed those looking for key pages to mark file start and end. (Image recognition can be used here, sometimes, but people also provide a quality check on the first operators.) We had many variables that could be set per client project.
I'm basing the rough outline below on that experience and how it appears that your goals and workflow are similar.
(Terminology: By client I mean our client, e.g. a specific bank. A project or client project is a set of documents from that client that contains many files, e.g. all mortgages handled by a specific branch in a given year. A file is a logical arrangement that would normally be a physical file folder for one of the client's customers, e.g. all mortgage papers for one address.)
Cut off the top, bottom, sides, and corners. Throw these out of your calculations (even though you'll probably store them in the final image). This will cover staple holes, binder holes, but also (tiny) folded corners and very minute torn edges which appear as black spots. Depending on how you're scanning, the last two may be less of a problem.
Vary the sizes of these cuts for each client project, as required. For example, even a very thin edge slice, say 1-2mm, will eliminate most ragged edges without increasing false positive rate.
Convert to black and white, 1 bit per pixel. I suspect you are already doing this for some client projects anyway, so doing this efficiently and effectively, which can be subtle, should be no extra work. (Even if you don't store the 1bpp image as the deliverable result, the conversion will be helpful in empty page detection.) Eliminate noise by dropping any black pixels with none or only one black neighbor (using all surrounding 8 as neighbors).
After cutting extremities (#1) and this simplistic noise reduction, blank pages will have a very low number of black pixels; most blanks will have no black pixels at all – exempting exceptionally poor page quality, inked stamps (when scanning back-sides, mentioned more below), or other circumstances across the whole project, and so forth.
Depending on client project, you may set hotspots to be watched – the converse of cutting off the sides. For example, watching a 1" strip where a single line at the top of the page would appear may reduce false positives. A low contrast scan or faded hardcopy (perhaps even pencil, which can be common on back-sides) with only one line of text will be distinguished from a blank page this way.
What sections are worth watching depends on each project, but do not try to divide the page up into tiles and then subdivide those tiles into areas of interest. Instead, parallelize this on the page level; e.g. 1 worker per core, each worker handles a full page at a time.
Depending on how you're keying individual files, you may find it helpful to drop blanks (before marking start-of-file pages, which is still often a manual process even at high volume) then watch for blank pages at unexpected points after files have been keyed (e.g. expected would be the last page of the file, without being two blanks in a row, etc.).
For example, if a particular project is only scanning one side of each page, then detecting two blank pages in a row is a good indication that a couple pages in a file were flipped upside-down (clients often hand over hardcopy files like this). Either the sorters (who remove things like staples and paperclips) or the first machine operators should have caught this mistake, but, regardless, it will now need a manual check to verify.
On the other hand, there were projects that had very clean files so sorters could insert (usually colored) blank pages marking file boundaries. In this case, the second set of people still did the keying by file number, but only had to examine the first page of each file. This wasn't rare, but not common either.
Before I start rambling a bit, I hope my main point comes across: you have to decide how to mitigate rates of false positives (= data loss) and false negatives (= annoying blanks and otherwise harmless, but a maximum allowed rate may still be specified in the project contract). That varies drastically by project and the type of files/documents you're handling, but it guides you in how to do the detection. You will get much better results from a tailored approach than trying one-size-fits-all, even if the tailored approaches are 80-98% similar.
If you're delivering 1bpp images to the client, for example, you might not even want/need to eliminate blanks as filesize (and ultimately size of the delivered dataset) won't be an issue. This can be an acceptable trade-off when eliminating most blanks is harder while maintaining a low false positive rate; such as for files with inked stamps ("received on", "accepted", "due date", etc.; they bleed through to the back) or other problems, for example.
My fall class does a bunch of image-processing projects.
Here's what I would try:
Project from color to grayscale
Pour all the pixels into a simple histogram with say 100 buckets between 0 and 1
Find a local minimum in the histogram such that the absoluete value of above - below is as small as possible, where above is the number of brighter pixels and below is the number of darker pixels
Force the above pixels to white and the below pixels to black
If you like, as an extra step you could remove black edges
If there are hardly any black pixels, the page is blank
The first two steps should be combined, and they are the only time-consuming steps; on a 600dpi images you may have to touch many millions of pixels. The rest will be lightning fast. I'd be very surprised if you can't classify multiple images per second—especially if you know there will be no black edges.
The only part that requires training or experiment is the last step. It's also possible that you will need to fiddle around with the number of buckets in the histogram; if there are too many buckets, you may have a bad local minimum.
Good luck, and report back to us how you make out!
Check out this line detection algorithm: http://homepages.inf.ed.ac.uk/rbf/HIPR2/linedet.htm. In addition to a detailed explanation of how the algo works there's a demo where you can use your own image and see the results. I tried two images: 1) a B&W scan of a receipt, 2) the B&W, "blank" back side of that same receipt. All of the edge detection algorithms I tried found edges on the "blank" page. But, this line detection algorithm was the only algorithm that correctly found lines on the front page and yet didn't find anything on the "blank" back page.
It looks as if you're trying to convert all paperwork for a company into digital documents. Some of this paper can be really old.
Say your text is black, and any other color is the background. If you take two weighted averages, one consisting of what you think is the text, and one consisting of the background, you can compare those two and see if they're distant enough to consider further evaluation. This will removing any uneven aging of the paper.
Staple holes and punched holes in paper are pretty standard in size, but they'd show up as gray or not at all if you're scanning on a white background. If not, then you can guess where these are and remove them.
Now, we look at areas of high interest, areas where the black pixels are the most dense. Select a portion of that and OCR it. Place the starting top-left closest to an area where text begins. On a typical document, a solid blank linear area going left-to-right and another going top-to-bottom denotes the top and left sides of a paragraph. You can be sure that you got a line of text because below a line of text is another blank left-to-right area. So you don't need to worry about selecting a portion that will chop text in half.
You could take the mean gray level (integer) of each few rows of the scanned image (depending on the resolution and how many lines are required to capture one line of text), then consider the spread of row means. If there is no text on the page, the spread of means should be small (i.e. background ranges from 250-255), and if there is text on the whole page or on part of the page, the spread would be much larger (i.e. 15 for text to 250 for background).
Seems to me like the solution should be computationally simple due to the large number of pages to check. Approaches requiring further processing (edge detection, filtering, etc) seem like overkill, and will take much longer to run.
There is no need to process pixel by pixel, using matrices will help this be more efficient, for example using Numpy you can calculate means, sums, etc. for entire rows, columns or matrices at once much more efficiently. There is also no need to process EVERY pixel, a good sample of rows should be able to accomplish the task with similar accuracy. 8bit accuracy should be fine, and you could even resample to large pixels before running this processing algorithm.
You can do a noisy trim, i.e. blur the image and do an auto-trim (without actually modifying the image). If the width or height of the trim result is below a threshold (e.g. 80 to 100 for a 600 dpi image) then the page is empty.
A proof of concept using the ImageMagick command line front-end:
$ convert scan.png -shave 300x0 -virtual-pixel White -blur 0x15 -fuzz 15% \
-trim info:
The above command assumes a 600 dpi DIN A4 black and white (1 Bit) image. It also ignores a margin of 300 pixels such that artifacts like perforation holes don't yield false negatives.

Classifying type samples from image files

Which approach would you suggest for automatically classifying type found in images? The samples are likely large, with black text on a white background.
The categories are defined here, with some examples on each (Google Books link): http://bit.ly/9Mnu7P This is an extended version of the VOX-ATypI classification system.
My initial thoughts on this were to train the system with lots of single character samples from each category, but I'm wondering if there's a better way that would eliminate the need to do the comparison one letter at a time.
First, you need to extract features for classification. Typefaces are generally distinguished by the thickness of lines, the presence of serifs, "circularity" of character parts. Thus, the possible features are:
The fraction of the number of black pixels on the fixed area.
Try to apply math morphology erosion few times (and/or use different masks) and compute this fraction
Compute the mean compactness of a character: perimeter^2 / area
After applying erosion, count the number of connected components for a character
Compute the elongation and other image moments, also the direction
etc
I see two options here: either compute mean features for all characters, or try to classify letters first, and than classify the font based on some specific letters (so, you train the different classifier for a different letter). It's hard to say which one is better in your case.
As for specific learning algorithm, Random Forest seems to be a good place to start. There's an implementation in the OpenCV library.

Good compression algorithm for low entropy image

I am currently trying to further compress a very simple image. The image uses 2 sets of colors as well as 1 character per "pixel". each set of color may be 1 of 16 options. Because of this I have already combined both colors into 1 byte per pixel representing both of them. I already implemented MTF and BWT encoding methods to assist in RLE. I am positive I can get some more compression out of it however I am not sure what algorithm to use. I have tried huffman however because of the fact the image tends to be small already and RLE compresses most of it due to the lack of entropy, huffman half the time increases the size by adding its decoding table to the file. Please note this will also be run on a slower system so any really heavy algorithms may not work either.
First off, it sounds like you should compress the background and character color images separately. Second, you say that "the colors don't change too often from pixel to pixel". Are some colors "closer" to each other than others? I.e., when color changes from color x, is it more likely to change to a small subset of the remaining colors? If so, you can map the colors to be more adjacent to those they are likely to change to, and taking differences before coding. Then runs of the same color become runs of zeros, and changes to the "next" color become ones.
Once you have a good representation as a series of bytes with lots of runs and a skewed probability of occurrence of bytes values, e.g. lots of zeros and one, then apply zlib or gzip to take advantage of the apparent redundancy and skew.

Algorithm to compare two images

Given two different image files (in whatever format I choose), I need to write a program to predict the chance if one being the illegal copy of another. The author of the copy may do stuff like rotating, making negative, or adding trivial details (as well as changing the dimension of the image).
Do you know any algorithm to do this kind of job?
These are simply ideas I've had thinking about the problem, never tried it but I like thinking about problems like this!
Before you begin
Consider normalising the pictures, if one is a higher resolution than the other, consider the option that one of them is a compressed version of the other, therefore scaling the resolution down might provide more accurate results.
Consider scanning various prospective areas of the image that could represent zoomed portions of the image and various positions and rotations. It starts getting tricky if one of the images are a skewed version of another, these are the sort of limitations you should identify and compromise on.
Matlab is an excellent tool for testing and evaluating images.
Testing the algorithms
You should test (at the minimum) a large human analysed set of test data where matches are known beforehand. If for example in your test data you have 1,000 images where 5% of them match, you now have a reasonably reliable benchmark. An algorithm that finds 10% positives is not as good as one that finds 4% of positives in our test data. However, one algorithm may find all the matches, but also have a large 20% false positive rate, so there are several ways to rate your algorithms.
The test data should attempt to be designed to cover as many types of dynamics as possible that you would expect to find in the real world.
It is important to note that each algorithm to be useful must perform better than random guessing, otherwise it is useless to us!
You can then apply your software into the real world in a controlled way and start to analyse the results it produces. This is the sort of software project which can go on for infinitum, there are always tweaks and improvements you can make, it is important to bear that in mind when designing it as it is easy to fall into the trap of the never ending project.
Colour Buckets
With two pictures, scan each pixel and count the colours. For example you might have the 'buckets':
white
red
blue
green
black
(Obviously you would have a higher resolution of counters). Every time you find a 'red' pixel, you increment the red counter. Each bucket can be representative of spectrum of colours, the higher resolution the more accurate but you should experiment with an acceptable difference rate.
Once you have your totals, compare it to the totals for a second image. You might find that each image has a fairly unique footprint, enough to identify matches.
Edge detection
How about using Edge Detection.
(source: wikimedia.org)
With two similar pictures edge detection should provide you with a usable and fairly reliable unique footprint.
Take both pictures, and apply edge detection. Maybe measure the average thickness of the edges and then calculate the probability the image could be scaled, and rescale if necessary. Below is an example of an applied Gabor Filter (a type of edge detection) in various rotations.
Compare the pictures pixel for pixel, count the matches and the non matches. If they are within a certain threshold of error, you have a match. Otherwise, you could try reducing the resolution up to a certain point and see if the probability of a match improves.
Regions of Interest
Some images may have distinctive segments/regions of interest. These regions probably contrast highly with the rest of the image, and are a good item to search for in your other images to find matches. Take this image for example:
(source: meetthegimp.org)
The construction worker in blue is a region of interest and can be used as a search object. There are probably several ways you could extract properties/data from this region of interest and use them to search your data set.
If you have more than 2 regions of interest, you can measure the distances between them. Take this simplified example:
(source: per2000.eu)
We have 3 clear regions of interest. The distance between region 1 and 2 may be 200 pixels, between 1 and 3 400 pixels, and 2 and 3 200 pixels.
Search other images for similar regions of interest, normalise the distance values and see if you have potential matches. This technique could work well for rotated and scaled images. The more regions of interest you have, the probability of a match increases as each distance measurement matches.
It is important to think about the context of your data set. If for example your data set is modern art, then regions of interest would work quite well, as regions of interest were probably designed to be a fundamental part of the final image. If however you are dealing with images of construction sites, regions of interest may be interpreted by the illegal copier as ugly and may be cropped/edited out liberally. Keep in mind common features of your dataset, and attempt to exploit that knowledge.
Morphing
Morphing two images is the process of turning one image into the other through a set of steps:
Note, this is different to fading one image into another!
There are many software packages that can morph images. It's traditionaly used as a transitional effect, two images don't morph into something halfway usually, one extreme morphs into the other extreme as the final result.
Why could this be useful? Dependant on the morphing algorithm you use, there may be a relationship between similarity of images, and some parameters of the morphing algorithm.
In a grossly over simplified example, one algorithm might execute faster when there are less changes to be made. We then know there is a higher probability that these two images share properties with each other.
This technique could work well for rotated, distorted, skewed, zoomed, all types of copied images. Again this is just an idea I have had, it's not based on any researched academia as far as I am aware (I haven't look hard though), so it may be a lot of work for you with limited/no results.
Zipping
Ow's answer in this question is excellent, I remember reading about these sort of techniques studying AI. It is quite effective at comparing corpus lexicons.
One interesting optimisation when comparing corpuses is that you can remove words considered to be too common, for example 'The', 'A', 'And' etc. These words dilute our result, we want to work out how different the two corpus are so these can be removed before processing. Perhaps there are similar common signals in images that could be stripped before compression? It might be worth looking into.
Compression ratio is a very quick and reasonably effective way of determining how similar two sets of data are. Reading up about how compression works will give you a good idea why this could be so effective. For a fast to release algorithm this would probably be a good starting point.
Transparency
Again I am unsure how transparency data is stored for certain image types, gif png etc, but this will be extractable and would serve as an effective simplified cut out to compare with your data sets transparency.
Inverting Signals
An image is just a signal. If you play a noise from a speaker, and you play the opposite noise in another speaker in perfect sync at the exact same volume, they cancel each other out.
(source: themotorreport.com.au)
Invert on of the images, and add it onto your other image. Scale it/loop positions repetitively until you find a resulting image where enough of the pixels are white (or black? I'll refer to it as a neutral canvas) to provide you with a positive match, or partial match.
However, consider two images that are equal, except one of them has a brighten effect applied to it:
(source: mcburrz.com)
Inverting one of them, then adding it to the other will not result in a neutral canvas which is what we are aiming for. However, when comparing the pixels from both original images, we can definatly see a clear relationship between the two.
I haven't studied colour for some years now, and am unsure if the colour spectrum is on a linear scale, but if you determined the average factor of colour difference between both pictures, you can use this value to normalise the data before processing with this technique.
Tree Data structures
At first these don't seem to fit for the problem, but I think they could work.
You could think about extracting certain properties of an image (for example colour bins) and generate a huffman tree or similar data structure. You might be able to compare two trees for similarity. This wouldn't work well for photographic data for example with a large spectrum of colour, but cartoons or other reduced colour set images this might work.
This probably wouldn't work, but it's an idea. The trie datastructure is great at storing lexicons, for example a dictionarty. It's a prefix tree. Perhaps it's possible to build an image equivalent of a lexicon, (again I can only think of colours) to construct a trie. If you reduced say a 300x300 image into 5x5 squares, then decompose each 5x5 square into a sequence of colours you could construct a trie from the resulting data. If a 2x2 square contains:
FFFFFF|000000|FDFD44|FFFFFF
We have a fairly unique trie code that extends 24 levels, increasing/decreasing the levels (IE reducing/increasing the size of our sub square) may yield more accurate results.
Comparing trie trees should be reasonably easy, and could possible provide effective results.
More ideas
I stumbled accross an interesting paper breif about classification of satellite imagery, it outlines:
Texture measures considered are: cooccurrence matrices, gray-level differences, texture-tone analysis, features derived from the Fourier spectrum, and Gabor filters. Some Fourier features and some Gabor filters were found to be good choices, in particular when a single frequency band was used for classification.
It may be worth investigating those measurements in more detail, although some of them may not be relevant to your data set.
Other things to consider
There are probably a lot of papers on this sort of thing, so reading some of them should help although they can be very technical. It is an extremely difficult area in computing, with many fruitless hours of work spent by many people attempting to do similar things. Keeping it simple and building upon those ideas would be the best way to go. It should be a reasonably difficult challenge to create an algorithm with a better than random match rate, and to start improving on that really does start to get quite hard to achieve.
Each method would probably need to be tested and tweaked thoroughly, if you have any information about the type of picture you will be checking as well, this would be useful. For example advertisements, many of them would have text in them, so doing text recognition would be an easy and probably very reliable way of finding matches especially when combined with other solutions. As mentioned earlier, attempt to exploit common properties of your data set.
Combining alternative measurements and techniques each that can have a weighted vote (dependant on their effectiveness) would be one way you could create a system that generates more accurate results.
If employing multiple algorithms, as mentioned at the begining of this answer, one may find all the positives but have a false positive rate of 20%, it would be of interest to study the properties/strengths/weaknesses of other algorithms as another algorithm may be effective in eliminating false positives returned from another.
Be careful to not fall into attempting to complete the never ending project, good luck!
Read the paper: Porikli, Fatih, Oncel Tuzel, and Peter Meer. “Covariance Tracking Using Model Update Based
on Means on Riemannian Manifolds”. (2006) IEEE Computer Vision and Pattern Recognition.
I was successfully able to detect overlapping regions in images captured from adjacent webcams using the technique presented in this paper. My covariance matrix was composed of Sobel, canny and SUSAN aspect/edge detection outputs, as well as the original greyscale pixels.
An idea:
use keypoint detectors to find scale- and transform- invariant descriptors of some points in the image (e.g. SIFT, SURF, GLOH, or LESH).
try to align keypoints with similar descriptors from both images (like in panorama stitching), allow for some image transforms if necessary (e.g. scale & rotate, or elastic stretching).
if many keypoints align well (exists such a transform, that keypoint alignment error is low; or transformation "energy" is low, etc.), you likely have similar images.
Step 2 is not trivial. In particular, you may need to use a smart algorithm to find the most similar keypoint on the other image. Point descriptors are usually very high-dimensional (like a hundred parameters), and there are many points to look through. kd-trees may be useful here, hash lookups don't work well.
Variants:
Detect edges or other features instead of points.
It is indeed much less simple than it seems :-) Nick's suggestion is a good one.
To get started, keep in mind that any worthwhile comparison method will essentially work by converting the images into a different form -- a form which makes it easier to pick similar features out. Usually, this stuff doesn't make for very light reading ...
One of the simplest examples I can think of is simply using the color space of each image. If two images have highly similar color distributions, then you can be reasonably sure that they show the same thing. At least, you can have enough certainty to flag it, or do more testing. Comparing images in color space will also resist things such as rotation, scaling, and some cropping. It won't, of course, resist heavy modification of the image or heavy recoloring (and even a simple hue shift will be somewhat tricky).
http://en.wikipedia.org/wiki/RGB_color_space
http://upvector.com/index.php?section=tutorials&subsection=tutorials/colorspace
Another example involves something called the Hough Transform. This transform essentially decomposes an image into a set of lines. You can then take some of the 'strongest' lines in each image and see if they line up. You can do some extra work to try and compensate for rotation and scaling too -- and in this case, since comparing a few lines is MUCH less computational work than doing the same to entire images -- it won't be so bad.
http://homepages.inf.ed.ac.uk/amos/hough.html
http://rkb.home.cern.ch/rkb/AN16pp/node122.html
http://en.wikipedia.org/wiki/Hough_transform
In the form described by you, the problem is tough. Do you consider copy, paste of part of the image into another larger image as a copy ? etc.
What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )
If you take a step-back, this is easier to solve if you watermark the master images.
You will need to use a watermarking scheme to embed a code into the image. To take a step back, as opposed to some of the low-level approaches (edge detection etc) suggested by some folks, a watermarking method is superior because:
It is resistant to Signal processing attacks
► Signal enhancement – sharpening, contrast, etc.
► Filtering – median, low pass, high pass, etc.
► Additive noise – Gaussian, uniform, etc.
► Lossy compression – JPEG, MPEG, etc.
It is resistant to Geometric attacks
► Affine transforms
► Data reduction – cropping, clipping, etc.
► Random local distortions
► Warping
Do some research on watermarking algorithms and you will be on the right path to solving your problem. (
Note: You can benchmark you method using the STIRMARK dataset. It is an accepted standard for this type of application.
This is just a suggestion, it might not work and I'm prepared to be called on this.
This will generate false positives, but hopefully not false negatives.
Resize both of the images so that they are the same size (I assume that the ratios of widths to lengths are the same in both images).
Compress a bitmap of both images with a lossless compression algorithm (e.g. gzip).
Find pairs of files that have similar file sizes. For instance, you could just sort every pair of files you have by how similar the file sizes are and retrieve the top X.
As I said, this will definitely generate false positives, but hopefully not false negatives. You can implement this in five minutes, whereas the Porikil et. al. would probably require extensive work.
I believe if you're willing to apply the approach to every possible orientation and to negative versions, a good start to image recognition (with good reliability) is to use eigenfaces: http://en.wikipedia.org/wiki/Eigenface
Another idea would be to transform both images into vectors of their components. A good way to do this is to create a vector that operates in x*y dimensions (x being the width of your image and y being the height), with the value for each dimension applying to the (x,y) pixel value. Then run a variant of K-Nearest Neighbours with two categories: match and no match. If it's sufficiently close to the original image it will fit in the match category, if not then it won't.
K Nearest Neighbours(KNN) can be found here, there are other good explanations of it on the web too: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
The benefits of KNN is that the more variants you're comparing to the original image, the more accurate the algorithm becomes. The downside is you need a catalogue of images to train the system first.
If you're willing to consider a different approach altogether to detecting illegal copies of your images, you could consider watermarking. (from 1.4)
...inserts copyright information into the digital object without the loss of quality. Whenever the copyright of a digital object is in question, this information is extracted to identify the rightful owner. It is also possible to encode the identity of the original buyer along with the identity of the copyright holder, which allows tracing of any unauthorized copies.
While it's also a complex field, there are techniques that allow the watermark information to persist through gross image alteration: (from 1.9)
... any signal transform of reasonable strength cannot remove the watermark. Hence a pirate willing to remove the watermark will not succeed unless they debase the document too much to be of commercial interest.
of course, the faq calls implementing this approach: "...very challenging" but if you succeed with it, you get a high confidence of whether the image is a copy or not, rather than a percentage likelihood.
If you're running Linux I would suggest two tools:
align_image_stack from package hugin-tools - is a commandline program that can automatically correct rotation, scaling, and other distortions (it's mostly intended for compositing HDR photography, but works for video frames and other documents too). More information: http://hugin.sourceforge.net/docs/manual/Align_image_stack.html
compare from package imagemagick - a program that can find and count the amount of different pixels in two images. Here's a neat tutorial: http://www.imagemagick.org/Usage/compare/ uising the -fuzz N% you can increase the error tolerance. The higher the N the higher the error tolerance to still count two pixels as the same.
align_image_stack should correct any offset so the compare command will actually have a chance of detecting same pixels.

Resources