similar images search solution - image

I've got a really big problem with my image storage server.
There are about 2,000,000 product images on it and keep increasing, but a lots of them are very similar. For example: an iPad photo with many similar sizes 120 * 120, 118 * 120, 131 * 125 ... etc. they took a lots of unnecessary disk space and bad user experience in my website (similar images in gallery).
Those images has indexed in database, I can find them with some conditions, like by product, category etc. I need to find a way to mark these similar images in database and remove them.
What I have done:
found a library named pHash can calculate two image's similarity, I can use it calculate images one by one. But in this way it will take a lots of time to find those images. Now I don't know how to make this process be more faster.
Any ideas?

Use pHash to calculate the perceptual hash of all your images (not of the crossproduct of each combination),
then sort that hash (while keeping the reference to the images),
then define a critical value of that perceptual hash that you define as "the pictures are equivalent",
then replace references to equivalent pictures with the reference to the one picture you want to keep.

You're right, a naive algorithm would be O(n^2) because you're doing a pairwise comparison across all of your n-sized dataset.
There is a technique called blocking, an implementation of which is canopy clustering, that can get around the pairwise comparisons by partitioning your comparison window size to a set of 'blocks' that are potentially similar.
You can cluster your images by extracting and sorting on a feature vector (which I'm not sure how to do on images).
Then, define a window of comparison, w, such that w < n.
Then apply a technique, called the sorted neighborhood method, which moves a window of fixed size w sequentially over the sorted records. Each image within the window is then paired with its "neighbor" and included in the candidate record pair list.
This basically reduces the comparison complexity to O(w * n), resulting in a linear algorithm with a constant w.
After you've performed the comparisons, you should take the transitive closure over matching pairs.
Your resulting pairs are now what would be considered similar images.
Note, this algorithm is embarrassingly parallel.

Related

Own fast Gamma Index implementation

My friends and I are writing our own implementation of Gamma Index algorithm. It should compute it within 1s for standard size 2d pictures (512 x 512) though could also calculate 3D pictures; be portable and easy to install and maintain.
Gamma Index, in case if you haven't came across this topic, is a method for comparing pictures. On input we provide two pictures (reference and target); every picture consist of points distributed over regular fine grid; every point has location and value. As output we receive a picture of Gamma Index values. For each point of target picture we calculate some function (called gamma) against every point from reference picture (in original version) or against points from reference picture, that are closest to the one from target picture (in version, that is usually used in Gamma Index calculation software). The Gamma Index for certain target point is minimum of calculated for it gamma function.
So far we have tried following ideas with these results:
use GPU - the calculation time has decreased 10 times. Problem is, that it's fairly difficult to install it on machines with non nVidia graphics card
use supercomputer or cluster - the problem is with maintenance of this solution. Plus every picture has to be ciphered for travel through network due to data sensitivity
iterate points ordered by their distances to target point with some extra stop criterion - this way we got 15 seconds at best condition (which is actually not ideally precise)
currently we are writing in Python due to NumPy awesome optimizations over matrix calculation, but we are open for other languages too.
Do you have any ideas how we can accelerate our algorithm(s), in order to meet the objectives? Do you think the obtaining of this level of performance is possible?
Some more information about GI for anyone interested:
http://lcr.uerj.br/Manual_ABFM/A%20technique%20for%20the%20quantitative%20evaluation%20of%20dose%20distributions.pdf

Algorithm for finding similar images using an index

There are some surprisingly good image compare tools which find similar image even if it's not exactly the same (eg. change in size, wallpaper, brightness/contrast). I have some example applications here:
Unique Filer 1.4 (shareware): https://web.archive.org/web/20010309014927/http://uniquefiler.com/
Fast Duplicate File Finder (Freeware): http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm
Visual similarity duplicate image finder (payware): http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm
Duplicate Checker (payware): http://www.duplicatechecker.com/
I only tried the first one, but all of them are developed for Windows and are not open source. Unique Filer was released in 2000 and the homepage seems to have disappeared. It was surprisingly fast (even on computers from that year) because it used an index and comparing some 10000 images using the index needed only some few seconds (and updating the index was a scalable process).
Since this algorithm in a very effective form already exists for at least 15 years, I assume it is well-documented and possibly already implemented as an open source library. Does anyone knows more about which algorithm or image detection theory was used to implement this applications? Maybe there is even a open source implementation of it available?
I already checked the question Algorithm for finding similar images but all of it's answers solve the problem by comparing one image to another. For 1000+ images this will result in 1000^2 comparing operations which is just not what I'm looking for.
The problem you are describing is more generally called Nearest Neighbor Search. Since you are asking for high efficiency on large datasets, Approximated Nearest Neighbor Search is what you are after.
An efficient technique for this is Locality-Sensitive Hashing (LSH), for which these slides give a great overview. Its basic idea is the use of hashing functions which project all data to a low-dimensional space, with the constraint that the hash of similar data collides with a high probability and dissimilar data collides with low probability. These probabilities are parameters to the algorithm, with which the trade-off between accuracy and efficiency can be changed.
LSHKIT is an open-source implementation of LSH.
Meanwhile, I analyzed the algorithm of UniqueFiler:
size reduction
First, it reduces all images to 10x10 pixel grayscale images (likely without using interpolation)
rotation
Probably based on the brightness of the 4 quadrants, some rotation is done (this step is dangerous because it sometimes 'overlooks' similarities if images are too symmetric)
range reduction
The image brightness range is fully extended (brightest -> white, darkest -> black) and then reduced to 2 bit (4 values) per pixel
database
The values get stored as arrays of 100 bytes per image (plus file metadata)
comparison
... is done one-by-one (two nested loops over the whole database plus a third for the 100 bytes). Today, we would probably index the sorted sums of all 4 quadrants for a fast pre-selection of similar candidates.
matcher
The comparison is done byte-by-byte by difference between each two bytes, weighted but less than the square. The sum of these 100 results is the final difference between two images.
I have more detailed information a home. If I find the time, I will add them to this answer. I found this after I discovered that the database format is actually a gzipped file without header, containing fixed-sized records per image

What happens in Hopscotch Hash Tables when there are more than sizeof(Neighborhood) actual hash collisions?

Relevant link: http://en.wikipedia.org/wiki/Hopscotch_hashing
Hopscotch hash tables seem great, but I haven't found an answer to this question in the literature: what happens if my neighborhood size is N and (due to malfeasance or extremely bad luck) I insert N+1 elements which all hash to the same exact value?
In the original article it is written that table needs to be resized:
Finally, notice that if more than a constant number of items are hashed by h into
a given bucket, the table needs to be resized. Luckily, as we show, for a universal
hash function h, the probability of this type of resize happening given H = 32 is
1/32!.
There are two cases where we need resize hopscotch hash
you have H collisions for the given bucket
the load factor is really too big to find the free bucket. In practice, you should setup a uplimit for search free bucket.
Given the universal hash function, you only have 1/32! chance to get into case #1, in other word, if you continuously insert 2^35 elements, then you have one chance to resize due to collisions.
The case #2 is more popular reason for resize in practice, you could refer to some quadratic implementations for how they decide to resize[C# hashmap and Google sparse hashmap], there is no real implementation for linear probe due to its cluster drawback, i.e. can't guarantee constant lookup.

Metric for SURF

I'm searching for a usable metric for SURF. Like how good one image matches another on a scale let's say 0 to 1, where 0 means no similarities and 1 means the same image.
SURF provides the following data:
interest points (and their descriptors) in query image (set Q)
interest points (and their descriptors) in target image (set T)
using nearest neighbor algorithm pairs can be created from the two sets from above
I was trying something so far but nothing seemed to work too well:
metric using the size of the different sets: d = N / min(size(Q), size(T)) where N is the number of matched interest points. This gives for pretty similar images pretty low rating, e.g. 0.32 even when 70 interest points were matched from about 600 in Q and 200 in T. I think 70 is a really good result. I was thinking about using some logarithmic scaling so only really low numbers would get low results, but can't seem to find the right equation. With d = log(9*d0+1) I get a result of 0.59 which is pretty good but still, it kind of destroys the power of SURF.
metric using the distances within pairs: I did something like find the K best match and add their distances. The smallest the distance the similar the two images are. The problem with this is that I don't know what are the maximum and minimum values for an interest point descriptor element, from which the distant is calculated, thus I can only relatively find the result (from many inputs which is the best). As I said I would like to put the metric to exactly between 0 and 1. I need this to compare SURF to other image metrics.
The biggest problem with these two are that exclude the other. One does not take in account the number of matches the other the distance between matches. I'm lost.
EDIT: For the first one, an equation of log(x*10^k)/k where k is 3 or 4 gives a nice result most of the time, the min is not good, it can make the d bigger then 1 in some rare cases, without it small result are back.
You can easily create a metric that is the weighted sum of both metrics. Use machine learning techniques to learn the appropriate weights.
What you're describing is related closely to the field of Content-Based Image Retrieval which is a very rich and diverse field. Googling that will get you lots of hits. While SURF is an excellent general purpose low-mid level feature detector, it is far from sufficient. SURF and SIFT (what SURF was derived from), is great at duplicate or near-duplicate detection but is not that great at capturing perceptual similarity.
The best performing CBIR systems usually utilize an ensemble of features optimally combined via some training set. Some interesting detectors to try include GIST (fast and cheap detector best used for detecting man-made vs. natural environments) and Object Bank (a histogram-based detector itself made of 100's of object detector outputs).

Graph plotting: only keeping most relevant data

In order to save bandwith and so as to not to have generate pictures/graphs ourselves I plan on using Google's charting API:
http://code.google.com/apis/chart/
which works by simply issuing a (potentially long) GET (or a POST) and then Google generate and serve the graph themselves.
As of now I've got graphs made of about two thousands entries and I'd like to trim this down to some arbitrary number of entries (e.g. by keeping only 50% of the original entries, or 10% of the original entries).
How can I decide which entries I should keep so as to have my new graph the closest to the original graph?
Is this some kind of curve-fitting problem?
Note that I know that I can do POST to Google's chart API with up to 16K of data and this may be enough for my needs, but I'm still curious
The flot-downsample plugin for the Flot JavaScript graphing library could do what you are looking for, up to a point.
The purpose is to try retain the visual characteristics of the original line using considerably fewer data points.
The research behind this algorithm is documented in the author's thesis.
Note that it doesn't work for any kind of series, and won't give meaningful results when you want a downsampling factor beyond 10, in my experience.
The problem is that it cuts the series in windows of equal sizes then keep one point per window. Since you may have denser data in some windows than others the result is not necessarily optimal. But it's efficient (runs in linear time).
What you are looking to do is known as downsampling or decimation. Essentially you filter the data and then drop N - 1 out of every N samples (decimation or down-sampling by factor of N). A crude filter is just taking a local moving average. E.g. if you want to decimate by a factor of N = 10 then replace every 10 points by the average of those 10 points.
Note that with the above scheme you may lose some high frequency data from your plot (since you are effectively low pass filtering the data) - if it's important to see short term variability then an alternative approach is to plot every N points as a single vertical bar which represents the range (i.e. min..max) of those N points.
Graph (time series data) summarization is a very hard problem. It's like deciding, in a text, what is the "relevant" part to keep in an automatic summarization of it. I suggest you use one of the most respected libraries for finding "patterns of interest" in time series data by Eamonn Keogh

Resources