Fitting tag cloud to available space

Fitting tag cloud to available space - user-interface

Given words and their frequencies and an area of screen real estate, what are good approaches to fitting a tag cloud to the space? The two variables I can think of to manipulate are:
Font sizes (both absolute and the gradient)
Number of words
Everything approach I can think of requires iteration, like setting an upper bound on the number of words then using binary search on font sizes until the words just fit the area. I'd rather have an analytical solution.
One complication of my situation is that the clouds are resizable, so the algorithm needs to be able to handle 100x100 pixels or 1000x1000 pixels reasonably well.
Edit: I should have said this is for a rich client application, not the web (hence the possibility of resizing). Also, I was hoping to hear some experience like "nobody ever looks at more than 100 words in tag cloud so don't bother displaying them".

What we do in Software Cartographer is
have a maximum font size,
map Math.sqrt(term.frequency) to this range (since words are 2D areas),
only show the top 30 (or so) terms,
exclude any fine print, ie font size smaller than 6 pt,
sort the terms in the cloud alphabetically.
Alternatives
Instead of showing the top 30, choose the top k such that there are no scroll bars.
Instead of mapping the most frequent word to the max font size, use a global mapping such that word size are comparable between clouds (this depends on your use case).
To my best knowledge, no empirical studies on term clouds are available (maybe Jonathan Feinberg, of Worlde fame, knows more in that regard).

This sounds like the knapsack problem, but inverted and with more variables. There is no trivial complete solution, but it is likely you will be able to find a heuristic algorithm that comes close to the optimal solution in most cases.
PS: You can only make this work reliably with font sizes measured in pixels. Font sizes measured in pixels are a Bad Thing (TM) in good web design.

You could create a predetermined set of incidence ranges, which could then relate to a font size in your cloud. For example:
0 - 100: 1 em
101 - 500: 1.2 em
501 - 1000: 1.4 em bold
1001 - 1500: 1.8 em bold
1501 - 2000: 2.0 em bold italic/underlined/flashing/whatever etc...
You could scale the cloud by adding a fixed offset to all the ranges based on the size of the container.

Related

How to provide a score value to an image based on pattern information in it?

I saw a few image processing and analysis related questions on this forum and thought I could try this forum for my question. I have a say 30 two-dimensional arrays (to make things simple, although I have a very big data set) which form 30 individual images. Many of these images have similar base structure, but differ in intensities for different pixels. Due to this intensity variation amongst pixels, some images have a prominent pattern (say a larger area with localised intense pixels or high intensity pixels classifying an edge). Some images, also just contain single high intensity pixels randomly distributed without any prominent feature (so basically noise). I am now trying to build an algorithm, which can give a specific score to an image based on different factors like area fraction of high intensity pixels, mean standard deviation, so that I can find out the image with the most prominent pattern (in order words rank them). But these factors depend on a common factor i.e. a user defined threshold, which becomes different for every image. Any inputs on how I can achieve this ranking or a image score in an automated manner (without the use of a threshold)? I initially used Matlab to perform all the processing and area fraction calculations, but now I am using R do the same thing.
Can some amount of machine learning/ random forest stuff help me here? I am not sure. Some inputs would be very valuable.
P.S. If this not the right forum to post, any suggestions on where I can get good advise?

First of all, let me suggest a change in terminology: What you denote as feature is usually called pattern in image prcessing, while what you call factor is usually called feature.
I think that the main weakness of the features you are using (mean, standard deviation) is that they are only based on the statistics of single pixels (1st order statistics) without considering correlations (neighborhood relations of pixels). If you take a highly stuctured image and shuffle the pixels randomly, you will still have the same 1st order statistics.
There are many ways to take these correlations into account. A simple, efficient and therefore popular method is to apply some filters on the image first (high-pass, low-pass etc.) and then get the 1st order statistics of the resulting image. Other methods are based on Fast Fourier Transform (FFT).
Of course machine learning is also an option here. You could try convolutional neural networks for example, but I would try the simple filtering stuff first.

How do font ID algorithms work?

I was wondering how automatic font identification services (like WhatTheFont, not question-based ones like Identifont) work. The most basic variant would be a service that lets you upload an image that contains text, and the service returns the name of the font used. How is this done, and how is it done so fast as to be practical? I'm fairly new to this kind of thing, but here's my understanding so far:
Perhaps some pre-processing to reduce noise. I'm not particularly interested in this part.
First the image is run through an OCR to extract the text – simple enough.
Then you go through every font in the tens/hundreds-of-thousands in your database and render the text you have extracted in each one, seeing if it's close to the original. Adjusting for size, alignment, kerning, different weights or italics, etc. How is this possibly fast enough to be practical?
Is this correct?
Please offer some insight into how this is done, and how it's done efficiently.

Let us assume you are doing the match in the raster representation (not on vectorized outlines).
Indeed, the text should be recognized first to reduce the number of comparisons with the characters in the reference fonts; at this stage it matters to avoid any dubious recognition as this would wreak havoc.
Then a stage of normalization is needed: you can transform the character position, size (and possibly italics angle ?) to a standard bounding box so that pixel-by-pixel comparison becomes possible. Then the amount of computation will be proportional to the area of the characters times the number of reference fonts.
Beware that normalization in size is not fully accurate as big character shrunk will differ from a smaller character in the same font in a few details and stroke thickness. It is probably useful to consider two or three representative sizes per font.

Detecting empty pages in scanned documents

So we need to detect whether an image, created by a scanner, represents an empty page. I'm way out of my depth when it comes to image processing, so I have to run this by the community.
Here's what I have come up with so far:
Empty pages can be glaringly white, gray recycled paper, or yellowed old paper. The current idea is to create a histogram for a page, look for a steep increase of the curve, and get the percentage of pixels are darker than that. If that exceeds a threshold, the page is likely not empty.
Since this would likely classify a page containing a single line of text at the top as empty, we would tile the page and gather statistics about each tile.
We would need to detect scanned staplers and holes from binding (likely only in certain tiles), but this can be put off to some later stage. However, if you have an idea what to look out for besides these two, please mention it in a comment.
This needs to be fast. It's part of a document processing workflow that processes (tens of) thousands of pages per day. If processing a page takes ten seconds longer, than our customers will have to tell their customers that they'll have to wait several days longer for their results. (If this results in more false positives, some customers would rather have someone check a few dozen found "empty" pages, than have their customer wait one more day.)
So here's my questions:
Is it a good idea to take this route or is there something better?
If we do it this way, how would I do this? What's a good (cheap) algorithm for finding a threshold for a page? Could we gain significant speed by assuming a similar threshold for a batch of documents? To which precision could brightness values be rounded, before getting logged? What quirks could we expect?

If you know that a scanned page is going to fill the image entirely, then calculating the standard deviation might be a good way of doing this.
I would suggest blurring page slightly to reduce some noise. Then calculate the SD for the page, in theory, a page the is more or less all one colour will have a low SD and one with lots of text will have a higher SD. Then it's a case of 'training' the system to work out when a page is plain and when it is text. You might find that certain pages are hard for it to tell.
You could have it trained by having it process a vast number of pages, and it goes through them all, and you say if it is plain or not.
EDIT
ok, a white page with black text, if we have just the page and no surrounding stuff, will have a mean colour of grey, probably a fairly light grey. Getting the average is a for loop through all the pixels, adding their values and then dividing by the number of pixels. I'm not good with this o(logN) stuff, but suffice to say, it will not that long. Unless you have HUGE images.
SD is a second for loop, this time we are counting up how different each pixel is from the mean, and then dividing by the mean. This will take a bit longer then the mean, as we have to do something like
diff = thispixel - mean;
if(diff < 0) {
diff = -diff;
}
runningTotal += diff;
For a plain coloured page, each pixel will be close to the mean value, thus our SD will be low. If the SD is below a certain value, we can assume that this means the page is all one colour.
This might have problems if their is very minimal amount of text, as it will not have a large influence on the SD, so maybe like you suggested in the question, break the page into sections. I suggest strips horizontally, as text tends to go this way. If we do one of these strips one at a time, once one strip suggests it has text, we can stop as we don't care if the rest is blank or not.
Blurring the page will help reduce noise, as the odd pixel of noise will be reduced in its impact, thus give you a 'tighter' SD. You could also use it to reduce the resolution of your image.
Say you sauce image is 300 wide by 900 high, you could sample pixels in blocks of nine, 3 *3, and thus end up with an image that is 100 wide by 300 high, so it can actually be used to reduce the amount of calculations you need to do, in this case by a ninth!
The main problem is going to be in working out how high an SD can be with just a plain page. Maybe have it find the SD of a load of blank pages.
By the sounds of it, you are probably going to want to have a middle ground that lets it be unsure and ask for human intervention, possibly letting the human value train the system to get better?

Perform some sort of simple edge detection. If the number of pixels constituting edges is below some threshold, then there's going to be a high probability the page is empty. This could be improved by classifying certain edges that correspond with high certainty (by shape and location) to punched holes and staples as trivial and discounting them from the metric.

When I worked for a document processor (~8 years ago), we handled client projects varying from very "clean" only-US-letter-sized pages to cover-/cardstock of irregular shapes mixed with normal pages. Operators fed pre-sorted files into scanning machines and only had to watch for folded corners and similar mechanical problems. Their output was multiple streams of hundreds of images corresponding to a range of files. A single scanner operator could easily scan 15k pieces of paper in a shift (that's only 0.60 pages/sec, while a scanner at speed could handle 3 pages/sec and still scan both sides). Later operators processed those looking for key pages to mark file start and end. (Image recognition can be used here, sometimes, but people also provide a quality check on the first operators.) We had many variables that could be set per client project.
I'm basing the rough outline below on that experience and how it appears that your goals and workflow are similar.
(Terminology: By client I mean our client, e.g. a specific bank. A project or client project is a set of documents from that client that contains many files, e.g. all mortgages handled by a specific branch in a given year. A file is a logical arrangement that would normally be a physical file folder for one of the client's customers, e.g. all mortgage papers for one address.)
Cut off the top, bottom, sides, and corners. Throw these out of your calculations (even though you'll probably store them in the final image). This will cover staple holes, binder holes, but also (tiny) folded corners and very minute torn edges which appear as black spots. Depending on how you're scanning, the last two may be less of a problem.
Vary the sizes of these cuts for each client project, as required. For example, even a very thin edge slice, say 1-2mm, will eliminate most ragged edges without increasing false positive rate.
Convert to black and white, 1 bit per pixel. I suspect you are already doing this for some client projects anyway, so doing this efficiently and effectively, which can be subtle, should be no extra work. (Even if you don't store the 1bpp image as the deliverable result, the conversion will be helpful in empty page detection.) Eliminate noise by dropping any black pixels with none or only one black neighbor (using all surrounding 8 as neighbors).
After cutting extremities (#1) and this simplistic noise reduction, blank pages will have a very low number of black pixels; most blanks will have no black pixels at all – exempting exceptionally poor page quality, inked stamps (when scanning back-sides, mentioned more below), or other circumstances across the whole project, and so forth.
Depending on client project, you may set hotspots to be watched – the converse of cutting off the sides. For example, watching a 1" strip where a single line at the top of the page would appear may reduce false positives. A low contrast scan or faded hardcopy (perhaps even pencil, which can be common on back-sides) with only one line of text will be distinguished from a blank page this way.
What sections are worth watching depends on each project, but do not try to divide the page up into tiles and then subdivide those tiles into areas of interest. Instead, parallelize this on the page level; e.g. 1 worker per core, each worker handles a full page at a time.
Depending on how you're keying individual files, you may find it helpful to drop blanks (before marking start-of-file pages, which is still often a manual process even at high volume) then watch for blank pages at unexpected points after files have been keyed (e.g. expected would be the last page of the file, without being two blanks in a row, etc.).
For example, if a particular project is only scanning one side of each page, then detecting two blank pages in a row is a good indication that a couple pages in a file were flipped upside-down (clients often hand over hardcopy files like this). Either the sorters (who remove things like staples and paperclips) or the first machine operators should have caught this mistake, but, regardless, it will now need a manual check to verify.
On the other hand, there were projects that had very clean files so sorters could insert (usually colored) blank pages marking file boundaries. In this case, the second set of people still did the keying by file number, but only had to examine the first page of each file. This wasn't rare, but not common either.
Before I start rambling a bit, I hope my main point comes across: you have to decide how to mitigate rates of false positives (= data loss) and false negatives (= annoying blanks and otherwise harmless, but a maximum allowed rate may still be specified in the project contract). That varies drastically by project and the type of files/documents you're handling, but it guides you in how to do the detection. You will get much better results from a tailored approach than trying one-size-fits-all, even if the tailored approaches are 80-98% similar.
If you're delivering 1bpp images to the client, for example, you might not even want/need to eliminate blanks as filesize (and ultimately size of the delivered dataset) won't be an issue. This can be an acceptable trade-off when eliminating most blanks is harder while maintaining a low false positive rate; such as for files with inked stamps ("received on", "accepted", "due date", etc.; they bleed through to the back) or other problems, for example.

My fall class does a bunch of image-processing projects.
Here's what I would try:
Project from color to grayscale
Pour all the pixels into a simple histogram with say 100 buckets between 0 and 1
Find a local minimum in the histogram such that the absoluete value of above - below is as small as possible, where above is the number of brighter pixels and below is the number of darker pixels
Force the above pixels to white and the below pixels to black
If you like, as an extra step you could remove black edges
If there are hardly any black pixels, the page is blank
The first two steps should be combined, and they are the only time-consuming steps; on a 600dpi images you may have to touch many millions of pixels. The rest will be lightning fast. I'd be very surprised if you can't classify multiple images per second—especially if you know there will be no black edges.
The only part that requires training or experiment is the last step. It's also possible that you will need to fiddle around with the number of buckets in the histogram; if there are too many buckets, you may have a bad local minimum.
Good luck, and report back to us how you make out!

Check out this line detection algorithm: http://homepages.inf.ed.ac.uk/rbf/HIPR2/linedet.htm. In addition to a detailed explanation of how the algo works there's a demo where you can use your own image and see the results. I tried two images: 1) a B&W scan of a receipt, 2) the B&W, "blank" back side of that same receipt. All of the edge detection algorithms I tried found edges on the "blank" page. But, this line detection algorithm was the only algorithm that correctly found lines on the front page and yet didn't find anything on the "blank" back page.

It looks as if you're trying to convert all paperwork for a company into digital documents. Some of this paper can be really old.
Say your text is black, and any other color is the background. If you take two weighted averages, one consisting of what you think is the text, and one consisting of the background, you can compare those two and see if they're distant enough to consider further evaluation. This will removing any uneven aging of the paper.
Staple holes and punched holes in paper are pretty standard in size, but they'd show up as gray or not at all if you're scanning on a white background. If not, then you can guess where these are and remove them.
Now, we look at areas of high interest, areas where the black pixels are the most dense. Select a portion of that and OCR it. Place the starting top-left closest to an area where text begins. On a typical document, a solid blank linear area going left-to-right and another going top-to-bottom denotes the top and left sides of a paragraph. You can be sure that you got a line of text because below a line of text is another blank left-to-right area. So you don't need to worry about selecting a portion that will chop text in half.

You could take the mean gray level (integer) of each few rows of the scanned image (depending on the resolution and how many lines are required to capture one line of text), then consider the spread of row means. If there is no text on the page, the spread of means should be small (i.e. background ranges from 250-255), and if there is text on the whole page or on part of the page, the spread would be much larger (i.e. 15 for text to 250 for background).
Seems to me like the solution should be computationally simple due to the large number of pages to check. Approaches requiring further processing (edge detection, filtering, etc) seem like overkill, and will take much longer to run.
There is no need to process pixel by pixel, using matrices will help this be more efficient, for example using Numpy you can calculate means, sums, etc. for entire rows, columns or matrices at once much more efficiently. There is also no need to process EVERY pixel, a good sample of rows should be able to accomplish the task with similar accuracy. 8bit accuracy should be fine, and you could even resample to large pixels before running this processing algorithm.

You can do a noisy trim, i.e. blur the image and do an auto-trim (without actually modifying the image). If the width or height of the trim result is below a threshold (e.g. 80 to 100 for a 600 dpi image) then the page is empty.
A proof of concept using the ImageMagick command line front-end:
$ convert scan.png -shave 300x0 -virtual-pixel White -blur 0x15 -fuzz 15% \
-trim info:
The above command assumes a 600 dpi DIN A4 black and white (1 Bit) image. It also ignores a margin of 300 pixels such that artifacts like perforation holes don't yield false negatives.

Image comparison - fast algorithm

I'm looking to create a base table of images and then compare any new images against that to determine if the new image is an exact (or close) duplicate of the base.
For example: if you want to reduce storage of the same image 100's of times, you could store one copy of it and provide reference links to it. When a new image is entered you want to compare to an existing image to make sure it's not a duplicate ... ideas?
One idea of mine was to reduce to a small thumbnail and then randomly pick 100 pixel locations and compare.

Below are three approaches to solving this problem (and there are many others).
The first is a standard approach in computer vision, keypoint matching. This may require some background knowledge to implement, and can be slow.
The second method uses only elementary image processing, and is potentially faster than the first approach, and is straightforward to implement. However, what it gains in understandability, it lacks in robustness -- matching fails on scaled, rotated, or discolored images.
The third method is both fast and robust, but is potentially the hardest to implement.
Keypoint Matching
Better than picking 100 random points is picking 100 important points. Certain parts of an image have more information than others (particularly at edges and corners), and these are the ones you'll want to use for smart image matching. Google "keypoint extraction" and "keypoint matching" and you'll find quite a few academic papers on the subject. These days, SIFT keypoints are arguably the most popular, since they can match images under different scales, rotations, and lighting. Some SIFT implementations can be found here.
One downside to keypoint matching is the running time of a naive implementation: O(n^2m), where n is the number of keypoints in each image, and m is the number of images in the database. Some clever algorithms might find the closest match faster, like quadtrees or binary space partitioning.
Alternative solution: Histogram method
Another less robust but potentially faster solution is to build feature histograms for each image, and choose the image with the histogram closest to the input image's histogram. I implemented this as an undergrad, and we used 3 color histograms (red, green, and blue), and two texture histograms, direction and scale. I'll give the details below, but I should note that this only worked well for matching images VERY similar to the database images. Re-scaled, rotated, or discolored images can fail with this method, but small changes like cropping won't break the algorithm
Computing the color histograms is straightforward -- just pick the range for your histogram buckets, and for each range, tally the number of pixels with a color in that range. For example, consider the "green" histogram, and suppose we choose 4 buckets for our histogram: 0-63, 64-127, 128-191, and 192-255. Then for each pixel, we look at the green value, and add a tally to the appropriate bucket. When we're done tallying, we divide each bucket total by the number of pixels in the entire image to get a normalized histogram for the green channel.
For the texture direction histogram, we started by performing edge detection on the image. Each edge point has a normal vector pointing in the direction perpendicular to the edge. We quantized the normal vector's angle into one of 6 buckets between 0 and PI (since edges have 180-degree symmetry, we converted angles between -PI and 0 to be between 0 and PI). After tallying up the number of edge points in each direction, we have an un-normalized histogram representing texture direction, which we normalized by dividing each bucket by the total number of edge points in the image.
To compute the texture scale histogram, for each edge point, we measured the distance to the next-closest edge point with the same direction. For example, if edge point A has a direction of 45 degrees, the algorithm walks in that direction until it finds another edge point with a direction of 45 degrees (or within a reasonable deviation). After computing this distance for each edge point, we dump those values into a histogram and normalize it by dividing by the total number of edge points.
Now you have 5 histograms for each image. To compare two images, you take the absolute value of the difference between each histogram bucket, and then sum these values. For example, to compare images A and B, we would compute
|A.green_histogram.bucket_1 - B.green_histogram.bucket_1|
for each bucket in the green histogram, and repeat for the other histograms, and then sum up all the results. The smaller the result, the better the match. Repeat for all images in the database, and the match with the smallest result wins. You'd probably want to have a threshold, above which the algorithm concludes that no match was found.
Third Choice - Keypoints + Decision Trees
A third approach that is probably much faster than the other two is using semantic texton forests (PDF). This involves extracting simple keypoints and using a collection decision trees to classify the image. This is faster than simple SIFT keypoint matching, because it avoids the costly matching process, and keypoints are much simpler than SIFT, so keypoint extraction is much faster. However, it preserves the SIFT method's invariance to rotation, scale, and lighting, an important feature that the histogram method lacked.
Update:
My mistake -- the Semantic Texton Forests paper isn't specifically about image matching, but rather region labeling. The original paper that does matching is this one: Keypoint Recognition using Randomized Trees. Also, the papers below continue to develop the ideas and represent the state of the art (c. 2010):
Fast Keypoint Recognition using Random Ferns - faster and more scalable than Lepetit 06
BRIEF: Binary Robust Independent Elementary Features - less robust but very fast -- I think the goal here is real-time matching on smart phones and other handhelds

The best method I know of is to use a Perceptual Hash. There appears to be a good open source implementation of such a hash available at:
http://phash.org/
The main idea is that each image is reduced down to a small hash code or 'fingerprint' by identifying salient features in the original image file and hashing a compact representation of those features (rather than hashing the image data directly). This means that the false positives rate is much reduced over a simplistic approach such as reducing images down to a tiny thumbprint sized image and comparing thumbprints.
phash offers several types of hash and can be used for images, audio or video.

This post was the starting point of my solution, lots of good ideas here so I though I would share my results. The main insight is that I've found a way to get around the slowness of keypoint-based image matching by exploiting the speed of phash.
For the general solution, it's best to employ several strategies. Each algorithm is best suited for certain types of image transformations and you can take advantage of that.
At the top, the fastest algorithms; at the bottom the slowest (though more accurate). You might skip the slow ones if a good match is found at the faster level.
file-hash based (md5,sha1,etc) for exact duplicates
perceptual hashing (phash) for rescaled images
feature-based (SIFT) for modified images
I am having very good results with phash. The accuracy is good for rescaled images. It is not good for (perceptually) modified images (cropped, rotated, mirrored, etc). To deal with the hashing speed we must employ a disk cache/database to maintain the hashes for the haystack.
The really nice thing about phash is that once you build your hash database (which for me is about 1000 images/sec), the searches can be very, very fast, in particular when you can hold the entire hash database in memory. This is fairly practical since a hash is only 8 bytes.
For example, if you have 1 million images it would require an array of 1 million 64-bit hash values (8 MB). On some CPUs this fits in the L2/L3 cache! In practical usage I have seen a corei7 compare at over 1 Giga-hamm/sec, it is only a question of memory bandwidth to the CPU. A 1 Billion-image database is practical on a 64-bit CPU (8GB RAM needed) and searches will not exceed 1 second!
For modified/cropped images it would seem a transform-invariant feature/keypoint detector like SIFT is the way to go. SIFT will produce good keypoints that will detect crop/rotate/mirror etc. However the descriptor compare is very slow compared to hamming distance used by phash. This is a major limitation. There are a lot of compares to do, since there are maximum IxJxK descriptor compares to lookup one image (I=num haystack images, J=target keypoints per haystack image, K=target keypoints per needle image).
To get around the speed issue, I tried using phash around each found keypoint, using the feature size/radius to determine the sub-rectangle. The trick to making this work well, is to grow/shrink the radius to generate different sub-rect levels (on the needle image). Typically the first level (unscaled) will match however often it takes a few more. I'm not 100% sure why this works, but I can imagine it enables features that are too small for phash to work (phash scales images down to 32x32).
Another issue is that SIFT will not distribute the keypoints optimally. If there is a section of the image with a lot of edges the keypoints will cluster there and you won't get any in another area. I am using the GridAdaptedFeatureDetector in OpenCV to improve the distribution. Not sure what grid size is best, I am using a small grid (1x3 or 3x1 depending on image orientation).
You probably want to scale all the haystack images (and needle) to a smaller size prior to feature detection (I use 210px along maximum dimension). This will reduce noise in the image (always a problem for computer vision algorithms), also will focus detector on more prominent features.
For images of people, you might try face detection and use it to determine the image size to scale to and the grid size (for example largest face scaled to be 100px). The feature detector accounts for multiple scale levels (using pyramids) but there is a limitation to how many levels it will use (this is tunable of course).
The keypoint detector is probably working best when it returns less than the number of features you wanted. For example, if you ask for 400 and get 300 back, that's good. If you get 400 back every time, probably some good features had to be left out.
The needle image can have less keypoints than the haystack images and still get good results. Adding more doesn't necessarily get you huge gains, for example with J=400 and K=40 my hit rate is about 92%. With J=400 and K=400 the hit rate only goes up to 96%.
We can take advantage of the extreme speed of the hamming function to solve scaling, rotation, mirroring etc. A multiple-pass technique can be used. On each iteration, transform the sub-rectangles, re-hash, and run the search function again.

My company has about 24million images come in from manufacturers every month. I was looking for a fast solution to ensure that the images we upload to our catalog are new images.
I want to say that I have searched the internet far and wide to attempt to find an ideal solution. I even developed my own edge detection algorithm.
I have evaluated speed and accuracy of multiple models.
My images, which have white backgrounds, work extremely well with phashing. Like redcalx said, I recommend phash or ahash. DO NOT use MD5 Hashing or anyother cryptographic hashes. Unless, you want only EXACT image matches. Any resizing or manipulation that occurs between images will yield a different hash.
For phash/ahash, Check this out: imagehash
I wanted to extend *redcalx'*s post by posting my code and my accuracy.
What I do:
from PIL import Image
from PIL import ImageFilter
import imagehash
img1=Image.open(r"C:\yourlocation")
img2=Image.open(r"C:\yourlocation")
if img1.width<img2.width:
img2=img2.resize((img1.width,img1.height))
else:
img1=img1.resize((img2.width,img2.height))
img1=img1.filter(ImageFilter.BoxBlur(radius=3))
img2=img2.filter(ImageFilter.BoxBlur(radius=3))
phashvalue=imagehash.phash(img1)-imagehash.phash(img2)
ahashvalue=imagehash.average_hash(img1)-imagehash.average_hash(img2)
totalaccuracy=phashvalue+ahashvalue
Here are some of my results:
item1 item2 totalsimilarity
desk1 desk1 3
desk1 phone1 22
chair1 desk1 17
phone1 chair1 34
Hope this helps!

As cartman pointed out, you can use any kind of hash value for finding exact duplicates.
One starting point for finding close images could be here. This is a tool used by CG companies to check if revamped images are still showing essentially the same scene.

I have an idea, which can work and it most likely to be very fast.
You can sub-sample an image to say 80x60 resolution or comparable,
and convert it to grey scale (after subsampling it will be faster).
Process both images you want to compare.
Then run normalised sum of squared differences between two images (the query image and each from the db),
or even better Normalised Cross Correlation, which gives response closer to 1, if
both images are similar.
Then if images are similar you can proceed to more sophisticated techniques
to verify that it is the same images.
Obviously this algorithm is linear in terms of number of images in your database
so even though it is going to be very fast up to 10000 images per second on the modern hardware.
If you need invariance to rotation, then a dominant gradient can be computed
for this small image, and then the whole coordinate system can be rotated to canonical
orientation, this though, will be slower. And no, there is no invariance to scale here.
If you want something more general or using big databases (million of images), then
you need to look into image retrieval theory (loads of papers appeared in the last 5 years).
There are some pointers in other answers. But It might be overkill, and the suggest histogram approach will do the job. Though I would think combination of many different
fast approaches will be even better.

I believe that dropping the size of the image down to an almost icon size, say 48x48, then converting to greyscale, then taking the difference between pixels, or Delta, should work well. Because we're comparing the change in pixel color, rather than the actual pixel color, it won't matter if the image is slightly lighter or darker. Large changes will matter since pixels getting too light/dark will be lost. You can apply this across one row, or as many as you like to increase the accuracy. At most you'd have 47x47=2,209 subtractions to make in order to form a comparable Key.

Picking 100 random points could mean that similar (or occasionally even dissimilar) images would be marked as the same, which I assume is not what you want. MD5 hashes wouldn't work if the images were different formats (png, jpeg, etc), had different sizes, or had different metadata. Reducing all images to a smaller size is a good bet, doing a pixel-for- pixel comparison shouldn't take too long as long as you're using a good image library / fast language, and the size is small enough.
You could try making them tiny, then if they are the same perform another comparison on a larger size - could be a good combination of speed and accuracy...

What we loosely refer to as duplicates can be difficult for algorithms to discern.
Your duplicates can be either:
Exact Duplicates
Near-exact Duplicates. (minor edits of image etc)
perceptual Duplicates (same content, but different view, camera etc)
No1 & 2 are easier to solve. No 3. is very subjective and still a research topic.
I can offer a solution for No1 & 2.
Both solutions use the excellent image hash- hashing library: https://github.com/JohannesBuchner/imagehash
Exact duplicates
Exact duplicates can be found using a perceptual hashing measure.
The phash library is quite good at this. I routinely use it to clean
training data.
Usage (from github site) is as simple as:
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
Near-Exact Duplicates
In this case you will have to set a threshold and compare the hash values for their distance from each
other. This has to be done by trial-and-error for your image content.
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
epsilon = 50
for img_fn1, img_fn2 in zip(image_fns, image_fns[::-1]):
if image_fn1 == image_fn2:
continue
hash1 = imagehash.average_hash(Image.open(image_fn1))
hash2 = imagehash.average_hash(Image.open(image_fn2))
if hash1 - hash2 < epsilon:
print( '{} is near duplicate of {}'.format(image_fn1, image_fn2) )

If you have a large number of images, look into a Bloom filter, which uses multiple hashes for a probablistic but efficient result. If the number of images is not huge, then a cryptographic hash like md5 should be sufficient.

I think it's worth adding to this a phash solution I built that we've been using for a while now: Image::PHash. It is a Perl module, but the main parts are in C. It is several times faster than phash.org and has a few extra features for DCT-based phashes.
We had dozens of millions of images already indexed on a MySQL database, so I wanted something fast and also a way to use MySQL indices (which don't work with hamming distance), which led me to use "reduced" hashes for direct matches, the module doc discusses this.
It's quite simple to use:
use Image::PHash;
my $iph1 = Image::PHash->new('file1.jpg');
my $p1 = $iph1->pHash();
my $iph2 = Image::PHash->new('file2.jpg');
my $p2 = $iph2->pHash();
my $diff = Image::PHash::diff($p1, $p2);

I made a very simple solution in PHP for comparing images several years ago. It calculates a simple hash for each image, and then finds the difference. It works very nice for cropped or cropped with translation versions of the same image.
First I resize the image to a small size, like 24x24 or 36x36. Then I take each column of pixels and find average R,G,B values for this column.
After each column has its own three numbers, I do two passes: first on odd columns and second on even ones. The first pass sums all the processed cols and then divides by their number ( [1] + [2] + [5] + [N-1] / (N/2) ). The second pass works in another manner: ( [3] - [4] + [6] - [8] ... / (N/2) ).
So now I have two numbers. As I found out experimenting, the first one is a major one: if it's far from the values of another image, they are not similar from the human point of view at all.
So, the first one represents the average brightness of the image (again, you can pay most attention to green channel, then the red one, etc, but the default R->G->B order works just fine). The second number can be compared if the first two are very close, and it in fact represents the overall contrast of the image: if we have some black/white pattern or any contrast scene (lighted buildings in the city at night, for example) and if we are lucky, we will get huge numbers here if out positive members of sum are mostly bright, and negative ones are mostly dark, or vice versa. As I want my values to be always positive, I divide by 2 and shift by 127 here.
I wrote the code in PHP in 2017, and seems I lost the code. But I still have the screenshots:
The same image:
Black & White version:
Cropped version:
Another image, ranslated version:
Same color gamut as 4th, but another scene:
I tuned the difference thresholds so that the results are really nice. But as you can see, this simple algorithm cannot do anything good with simple scene translations.
On a side note I can notice that a modification can be written to make cropped copies from each of two images at 75-80 percent, 4 at the corners or 8 at the corners and middles of the edges, and then by comparing the cropped variants with another whole image just the same way; and if one of them gets a significantly better similarity score, then use its value instead of the default one).

How can I measure the similarity between two images? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I would like to compare a screenshot of one application (could be a Web page) with a previously taken screenshot to determine whether the application is displaying itself correctly. I don't want an exact match comparison, because the aspect could be slightly different (in the case of a Web app, depending on the browser, some element could be at a slightly different location). It should give a measure of how similar are the screenshots.
Is there a library / tool that already does that? How would you implement it?

This depends entirely on how smart you want the algorithm to be.
For instance, here are some issues:
cropped images vs. an uncropped image
images with a text added vs. another without
mirrored images
The easiest and simplest algorithm I've seen for this is just to do the following steps to each image:
scale to something small, like 64x64 or 32x32, disregard aspect ratio, use a combining scaling algorithm instead of nearest pixel
scale the color ranges so that the darkest is black and lightest is white
rotate and flip the image so that the lighest color is top left, and then top-right is next darker, bottom-left is next darker (as far as possible of course)
Edit A combining scaling algorithm is one that when scaling 10 pixels down to one will do it using a function that takes the color of all those 10 pixels and combines them into one. Can be done with algorithms like averaging, mean-value, or more complex ones like bicubic splines.
Then calculate the mean distance pixel-by-pixel between the two images.
To look up a possible match in a database, store the pixel colors as individual columns in the database, index a bunch of them (but not all, unless you use a very small image), and do a query that uses a range for each pixel value, ie. every image where the pixel in the small image is between -5 and +5 of the image you want to look up.
This is easy to implement, and fairly fast to run, but of course won't handle most advanced differences. For that you need much more advanced algorithms.

The 'classic' way of measuring this is to break the image up into some canonical number of sections (say a 10x10 grid) and then computing a histogram of RGB values inside of each cell and compare corresponding histograms. This type of algorithm is preferred because of both its simplicity and it's invariance to scaling and (small!) translation.

Use a normalised colour histogram. (Read the section on applications here), they are commonly used in image retrieval/matching systems and are a standard way of matching images that is very reliable, relatively fast and very easy to implement.
Essentially a colour histogram will capture the colour distribution of the image. This can then be compared with another image to see if the colour distributions match.
This type of matching is pretty resiliant to scaling (once the histogram is normalised), and rotation/shifting/movement etc.
Avoid pixel-by-pixel comparisons as if the image is rotated/shifted slightly it may lead to a large difference being reported.
Histograms would be straightforward to generate yourself (assuming you can get access to pixel values), but if you don't feel like it, the OpenCV library is a great resource for doing this kind of stuff. Here is a powerpoint presentation that shows you how to create a histogram using OpenCV.

Don't video encoding algorithms like MPEG compute the difference between each frame of a video so they can just encode the delta? You might look into how video encoding algorithms compute those frame differences.
Look at this open source image search application http://www.semanticmetadata.net/lire/. It describes several image similarity algorighms, three of which are from the MPEG-7 standard: ScalableColor, ColorLayout, EdgeHistogram and Auto Color Correlogram.

You could use a pure mathematical approach of O(n^2), but it will be useful only if you are certain that there's no offset or something like that. (Although that if you have a few objects with homogeneous coloring it will still work pretty well.)
Anyway, the idea is the compute the normalized dot-product of the two matrices.
C = sum(Pij*Qij)^2/(sum(Pij^2)*sum(Qij^2)).
This formula is actually the "cosine" of the angle between the matrices (wierd).
The bigger the similarity (lets say Pij=Qij), C will be 1, and if they're completely different, lets say for every i,j Qij = 1 (avoiding zero-division), Pij = 255, then for size nxn, the bigger n will be, the closer to zero we'll get. (By rough calculation: C=1/n^2).

You'll need pattern recognition for that. To determine small differences between two images, Hopfield nets work fairly well and are quite easy to implement. I don't know any available implementations, though.

A ruby solution can be found here
From the readme:
Phashion is a Ruby wrapper around the pHash library, "perceptual hash", which detects duplicate and near duplicate multimedia files

How to measure similarity between two images entirely depends on what you would like to measure, for example: contrast, brightness, modality, noise... and then choose the best suitable similarity measure there is for you. You can choose from MAD (mean absolute difference), MSD (mean squared difference) which are good for measuring brightness...there is also available CR (correlation coefficient) which is good in representing correlation between two images. You could also choose from histogram based similarity measures like SDH (standard deviation of difference image histogram) or multimodality similarity measures like MI (mutual information) or NMI (normalized mutual information).
Because this similarity measures cost much in time, it is advised to scale images down before applying these measures on them.

I wonder (and I'm really just throwing the idea out there to be shot down) if something could be derived by subtracting one image from the other, and then compressing the resulting image as a jpeg of gif, and taking the file size as a measure of similarity.
If you had two identical images, you'd get a white box, which would compress really well. The more the images differed, the more complex it would be to represent, and hence the less compressible.
Probably not an ideal test, and probably much slower than necessary, but it might work as a quick and dirty implementation.

You might look at the code for the open source tool findimagedupes, though it appears to have been written in perl, so I can't say how easy it will be to parse...
Reading the findimagedupes page that I liked, I see that there is a C++ implementation of the same algorithm. Presumably this will be easier to understand.
And it appears you can also use gqview.

Well, not to answer your question directly, but I have seen this happen. Microsoft recently launched a tool called PhotoSynth which does something very similar to determine overlapping areas in a large number of pictures (which could be of different aspect ratios).
I wonder if they have any available libraries or code snippets on their blog.

to expand on Vaibhav's note, hugin is an open-source 'autostitcher' which should have some insight on the problem.

There's software for content-based image retrieval, which does (partially) what you need. All references and explanations are linked from the project site and there's also a short text book (Kindle): LIRE

You can use Siamese Network to see if the two images are similar or dissimilar following this tutorial. This tutorial cluster the similar images whereas you can use L2 distance to measure the similarity of two images.

Beyond Compare has pixel-by-pixel comparison for images, e.g.,

If this is something you will be doing on an occasional basis and doesn't need automating, you can do it in an image editor that supports layers, such as Photoshop or Paint Shop Pro (probably GIMP or Paint.Net too, but I'm not sure about those). Open both screen shots, and put one as a layer on top of the other. Change the layer blending mode to Difference, and everything that's the same between the two will become black. You can move the top layer around to minimize any alignment differences.

Well a really base-level method to use could go through every pixel colour and compare it with the corresponding pixel colour on the second image - but that's a probably a very very slow solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio