How does the image recognition work in Google Shopper? - algorithm

I am amazed at how well (and fast) this software works. I hovered my phone's camera over a small area of a book cover in dim light and it only took a couple of seconds for Google Shopper to identify it. It's almost magical. Does anyone know how it works?

I have no idea how Google Shopper actually works. But it could work like this:
Take your image and convert to edges (using an edge filter, preserving color information).
Find points where edges intersect and make a list of them (including colors and perhaps angles of intersecting edges).
Convert to a rotation-independent metric by selecting pairs of high-contrast points and measuring distance between them. Now the book cover is represented as a bunch of numbers: (edgecolor1a,edgecolor1b,edgecolor2a,edgecolor2b,distance).
Pick pairs of the most notable distance values and ratio the distances.
Send this data as a query string to Google, where it finds the most similar vector (possibly with direct nearest-neighbor computation, or perhaps with an appropriately trained classifier--probably a support vector machine).
Google Shopper could also send the entire picture, at which point Google could use considerably more powerful processors to crunch on the image processing data, which means it could use more sophisticated preprocessing (I've chosen the steps above to be so easy as to be doable on smartphones).
Anyway, the general steps are very likely to be (1) extract scale and rotation-invariant features, (2) match that feature vector to a library of pre-computed features.

In any case, the Pattern Recognition/Machine Learning methods often are based on:
Extract features from the image that can be described as numbers. For instance, using edges (as Rex Kerr explained before), color, texture, etc. A set of numbers that describes or represents an image is called "feature vector" or sometimes "descriptor". After extracting the "feature vector" of an image it is possible to compare images using a distance or (dis)similarity function.
Extract text from the image. There are several method to do it, often based on OCR (optical character recognition)
Perform a search on a database using the features and the text in order to find the closest related product.
It is also likely that the image is also cuted into subimages, since the algorithm often finds a specific logo on the image.
In my opinion, the image features are send to different pattern classifiers (algorithms that are able to predict a "class" using as input a feature vector), in order to recognize logos and, afterwards, the product itself.
Using this approach, it can be: local, remote or mixed. If local, all processing is carried out on the device, and just the "feature vector" and "text" are sent to a server where the database is. If remote, the whole image goes to the server. If mixed (I think this is the most probable one), partially executed locally and partially at the server.
Another interesting software is the Google Googles, that uses CBIR (content-based image retrieval) in order to search for other images that are related to the picture taken by the smartphone. It is related to the problem that is addressed by Shopper.

Pattern Recognition.

Related

Resources & approaches for detecting content-aware fill

How would one approach the problem of telling apart images (and highlighting areas) that have been content-aware filled?
Positive class image (original image without edits):
Negative class image (Tampered image with insignia removed):
I understand that this can be quite meaningfully tackled by training a segmentation CNN network. Is there a working solution that I haven't been able to find or an approach that in relatively simple to implement.
These images are samples from a codebase that throws Runtime errors.
I am no expert on this, but as far as I know structural image manipulation algorithms like Photoshop's content aware fill (which is based on the PatchMatch Algorithm) are dividing an image up into so-called patches defined by a common intensity statistic. The manipulation is then carried out by replacing the target patch by a source patch with statistics matching the adjacent patches as well as possible (minimizing some distance metric).
Therefore, in a forged image you would find cloned regions, e.g. patches with identical or almost identical intensity distributions. To detect this, my naive approach would be to split the image up in many small subimages and use each one as a filter mask to be cross-correlated with the full image. I would expect authentic images to show only one more or less clear global maximum. A forged image would potentially produce two or more "global" maxima having a similar height.
But there seems to be a whole bunch of people having released some open source code for forgery detection, most probably using smarter approaches then mine -> Github topic: Fogery detection.

Object detection + segmentation

I 'm trying to find an efficient way of acceptable complexity to
detect an object in an image so I can isolate it from its surroundings
segment that object to its sub-parts and label them so I can then fetch them at will
It's been 3 weeks since I entered the image processing world and I've read about so many algorithms (sift, snakes, more snakes, fourier-related, etc.), and heuristics that I don't know where to start and which one is "best" for what I'm trying to achieve. Having in mind that the image dataset in interest is a pretty large one, I don't even know if I should use some algorithm implemented in OpenCV or if I should implement one my own.
Summarize:
Which methodology should I focus on? Why?
Should I use OpenCV for that kind of stuff or is there some other 'better' alternative?
Thank you in advance.
EDIT -- More info regarding the datasets
Each dataset consists of 80K images of products sharing the same
concept e.g. t-shirts, watches, shoes
size
orientation (90% of them)
background (95% of them)
All pictures in each datasets look almost identical apart from the product itself, apparently. To make things a little more clear, let's consider only the 'watch dataset':
All the pictures in the set look almost exactly like this:
(again, apart form the watch itself). I want to extract the strap and the dial. The thing is that there are lots of different watch styles and therefore shapes. From what I've read so far, I think I need a template algorithm that allows bending and stretching so as to be able to match straps and dials of different styles.
Instead of creating three distinct templates (upper part of strap, lower part of strap, dial), it would be reasonable to create only one and segment it into 3 parts. That way, I would be confident enough that each part was detected with respect to each other as intended to e.g. the dial would not be detected below the lower part of the strap.
From all the algorithms/methodologies I've encountered, active shape|appearance model seem to be the most promising ones. Unfortunately, I haven't managed to find a descent implementation and I'm not confident enough that that's the best approach so as to go ahead and write one myself.
If anyone could point out what I should be really looking for (algorithm/heuristic/library/etc.), I would be more than grateful. If again you think my description was a bit vague, feel free to ask for a more detailed one.
From what you've said, here are a few things that pop up at first glance:
Simplest thing to do it binarize the image and do Connected Components using OpenCV or CvBlob library. For simple images with non-complex background this usually yeilds objects
HOwever, looking at your sample image, texture-based segmentation techniques may work better - the watch dial, the straps and the background are wisely variant in texture/roughness, and this could be an ideal way to separate them.
The roughness of a portion can be easily found by the Eigen transform (explained a bit on SO, check the link to the research paper provided there), then the Mean Shift filter can be applied on the output of the Eigen transform. This will give regions clearly separated according to texture. Both the pyramidal Mean Shift and finding eigenvalues by SVD are implemented in OpenCV, so unless you can optimize your own code its better (and easier) to use inbuilt functions (if present) as far as speed and efficiency is concerned.
I think I would turn the problem around. Instead of hunting for the dial, I would use a set of robust features from the watch to 'stitch' the target image onto a template. The first watch has a set of squares in the dial that are white, the second watch has a number of white circles. I would per type of watch:
Segment out the squares or circles in the dial. Segmentation steps can be tricky as they are usually both scale and light dependent
Estimate the centers or corners of the above found feature areas. These are the new feature points.
Use the Hungarian algorithm to match features between the template watch and the target watch. Alternatively, one can take the surroundings of each feature point in the original image and match these using cross correlation
Use matching features between the template and the target to estimate scaling, rotation and translation
Stitch the image
As the image is now in a known form, one can extract the regions simply via pre set coordinates

image feature identification

I am looking for a solution to do the following:
( the focus of my question is step 2. )
a picture of a house including the front yard
extract information from the picture like the dimensions and location of the house, trees, sidewalk, and car. Also, the textures and colors of the house, cars, trees, and sidewalk.
use extracted information to generate a model
How can I extract that information?
You could also consult Tatiana Jaworska research on this. As I understood, this details at least 1 new algorithm to feature extraction (targeted at roofs, doors, ...) by colour (RGB). More intriguing, the last publication also uses parameterized objects to be identified in the house images... that must might be a really good starting point for what you're trying to do.
link to her publications:
http://www.springerlink.com/content/w518j70542780r34/
http://portal.acm.org/citation.cfm?id=1578785
http://www.ibspan.waw.pl/~jaworska/TJ_BOS2010.pdf
Yes. You can extract these information from a picture.
1. You just identify these objects in a picture using some detection algorithms.
2. Measure these objects dimensions and generate a model using extracted information.
well actually your desired goal is not so easy to achieve. First of all you'll need a good way to figure what what is what and what is where on your image. And there simply is no easy "algorithm" for detecting houses/cars/whatsoever on an image. There are ways to segment different objects (like cars) from an image, but those don't work generally. Especially on houses this would be hard since each house looks different and it's hard to find one solid measurement for "this is house and this is not"...
Am I assuming it right that you are trying to simply photograph a house (with front yard) and build a texturized 3D-model out of it? This is not going to work since you need several photos of the house to get positions of walls/corners and everything in 3D space (There are approaches that try a mesh reconstruction with one image only but they lack of depth information and results are fairly poor). So if you would like to create 3D-mdoels you will need several photos of different angles of the house.
There are several different approaches that use this kind of technique to reconstruct real world objects to triangle-meshes.
Basically they work after the principle:
Try to find points in images of different viewpoint which are the same on an object. Considering you are photographing a house this could be salient structures likes corners of windows/doors or corners or edges on the walls/roof/...
Knowing where one and the same point of your house is in several different photos and knowing the position of the camera of both photos you can reconstruct this point in 3D-space.
Doing this for a lot of equal points will "empower" you to reconstruct the shape of your house as a 3D-model by triangulating the points.
Taking parts of the image as textures and mapping them on the generated model would work as well since you know where what is.
You should have a look at these papers:
http://www.graphicon.ru/1999/3D%20Reconstruction/Valiev.pdf
http://people.csail.mit.edu/wojciech/pubs/LabeledRec.pdf
http://people.csail.mit.edu/sparis/publi/2006/oceans/Paris_06_3D_Reconstruction.ppt
The second paper even has an example of doing exactly what you try to achieve, namely reconstruct a textured 3D-model of a house photographed from different angles.
The third link is a powerpoint presentation that shows how the reconstruction works and shows the drawbacks there are.
So you should get familiar with these papers to see what problems you are up to... If you then want to try this on your own have a look at OpenCV. This library provides some methods for feature extraction in images. You then can try to find salient points in each image and try to match them.
Good luck on your project... If you have problems, please keep asking!
I suggest to look at this blog
https://jwork.org/main/node/35
that shows how to identify certain features on images using a convolutional neural network. This particular blog discusses how to identify human faces on images from a large set of random images. You can adjust this example to train neural network using some other images. Note that even in the case of human faces, the identification rate is about 85%, therefore, more complex objects can be even harder to identify

What algorithm could be used to identify if images are the "same" or similar, regardless of size?

TinEye, the "reverse image search engine", allows you to upload/link to an image and it is able to search through the billion images it has crawled and it will return links to images it has found that are the same image.
However, it isn't a naive checksum or anything related to that. It is often able to find both images of a higher resolution and lower resolution and larger and smaller size than the original image you supply. This is a good use for the service because I often find an image and want the highest resolution version of it possible.
Not only that, but I've had it find images of the same image set, where the people in the image are in a different position but the background largely stays the same.
What type of algorithm could TinEye be using that would allow it to compare an image with others of various sizes and compression ratios and yet still accurately figure out that they are the "same" image or set?
These algorithms are usually fingerprint-based. Fingerprint is a reasonably small data structure, something like a long hash code. However, the goals of fingerprint function are opposite to the goals of hash function. A good hash function should generate very different codes for very similar (but not equal) objects. The fingerprint function should, on contrary, generate the same fingerprint for similar images.
Just to give you an example, this is a (not particularly good) fingerprint function: resize the picture to 32x32 square, normalize and and quantize the colors, reducing the number of colors to something like 256. Then, you have 1024-byte fingerprint for the image. Just keep a table of fingerprint => [list of image URLs]. When you need to look images similar to a given image, just calculate its fingerprint value and find the corresponding image list. Easy.
What is not easy - to be useful in practice, the fingerprint function needs to be robust against crops, affine transforms, contrast changes, etc. Construction of good fingerprint functions is a separate research topic. Quite often they are hand-tuned and uses a lot of heuristics (i.e. use the knowledge about typical photo contents, about image format / additional data in EXIF, etc.)
Another variation is to use more than one fingerprint function, try to apply each of them and combine the results. Actually, it's similar to finding similar texts. Just instead of "bag of words" the image similarity search uses a "bag of fingerprints" and finds how many elements from one bag are the same as elements from another bag. How to make this search efficient is another topic.
Now, regarding the articles/papers. I couldn't find a good article that would give an overview of different methods. Most of the public articles I know discuss specific improvement to specific methods. I could recommend to check these:
"Content Fingerprinting Using Wavelets". This article is about audio fingerprinting using wavelets, but the same method can be adapted for image fingerprinting.
PERMUTATION GROUPING:
INTELLIGENT HASH FUNCTION DESIGN FOR AUDIO & IMAGE RETRIEVAL. Info on Locality-Sensitive Hashes.
Bundling Features for Large Scale Partial-Duplicate Web Image Search. A very good article, talks about SIFT and bundling features for efficiency. It also has a nice bibliography at the end
The creator of the FotoForensics site posted this blog post on this topic, it was very useful to me, and showed algorithms that may be good enough for you and that require a lot less work than wavelets and feature extraction.
http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
aHash (also called Average Hash or Mean Hash). This approach crushes the image into a grayscale 8x8 image and sets the 64 bits in
the hash based on whether the pixel's value is greater than the
average color for the image.
pHash (also called "Perceptive Hash"). This algorithm is similar to aHash but use a discrete cosine transform (DCT) and compares based
on frequencies rather than color values.
dHash Like aHash and pHash, dHash is pretty simple to implement and is far more accurate than it has any right to be. As an
implementation, dHash is nearly identical to aHash but it performs
much better. While aHash focuses on average values and pHash evaluates
frequency patterns, dHash tracks gradients.
It's probably based on improvements of feature extraction algorithms, taking advantage of features which are scale invariant.
Take a look at
Feature extraction
SIFT, other site
or, if you are REALLY interested, you can shell out some 70 bucks (or at least look at the Google preview) for
Feature Extraction & Image Processing
http://tineye.com/faq#how
Based on this, Igor Krivokon's answer seems to be on the mark.
The Hough Transform is a very old feature extraction algorithm, that you mind find interesting. I doubt it's what tinyeye uses, but it's a good, simple starting place for learning about feature extraction.
There are also slides to a neat talk from some University of Toronto folks about their work at astrometry.net. They developed an algorithm for matching telescoping images of the night sky to locations in star catalogs in order to identify the features in the image. It's a more specific problem than what tinyeye tries to solve, but I'd expect that a lot of the basic ideas that they talk about are applicable to the more general problem.
Check out this blog post (not mine) for a very understandable description of a very understandable algorithm which seems to get good results for how simple it is. It basically partitions the respective pictures into a very coarse grid, sorts the grid by red:blue and green:blue ratios, and checks whether the sorts were the same. This naturally works for color images only.
The pros most likely get better results using vastly more advanced algorithms. As mentioned in the comments on that blog, a leading approach seems to be wavelets.
They may well be doing a Fourier Transform to characterize the complexity of the image, as well as a histogram to characterize the chromatic distribution, paired with a region categorization algorithm to assure that similarly complex and colored images don't get wrongly paired. Don't know if that's what they're using, but it seems like that would do the trick.
What about resizing the pictures to a standard small size and checking for SSIM or luma-only PSNR values? that's what I would do.

Computing the difference between images

Do you guys know of any algorithms that can be used to compute difference between images?
Take this webpage for example http://tineye.com/ You give it a link or upload an image and it finds similiar images. I doubt that it compares the image in question against all of them (or maybe it does).
By compute I mean like what the Levenshtein_distance or the Hamming distance is for strings.
By no means do I need to the correct answer for a project or anything, I just found the website and got very curious. I know digg pays for a similiar service for their website.
The very simplest measures are going to be RMS-error based approaches, for example:
Root Mean Square Deviation
Peak Signal to Noise Ratio
These probably gel with your notions of distance measures, but their results are really only meaningful if you've got two images that are very close already, like if you're looking at how well a particular compression scheme preserved the original image. Also, the same result from either comparison can mean a lot of different things, depending on what kind of artifacts there are (take a look at the paper I cite below for some example photos of RMS/PSNR can be misleading).
Beyond these, there's a whole field of research devoted to image similarity. I'm no expert, but here are a few pointers:
A lot of work has gone into approaches using dimensionality reduction (PCA, SVD, eigenvalue analysis, etc) to pick out the principal components of the image and compare them across different images.
Other approaches (particularly medical imaging) use segmentation techniques to pick out important parts of images, then they compare the images based on what's found
Still others have tried to devise similarity measures that get around some of the flaws of RMS error and PSNR. There was a pretty cool paper on the spatial domain structural similarity (SSIM) measure, which tries to mimic peoples' perceptions of image error instead of direct, mathematical notions of error. The same guys did an improved translation/rotation-invariant version using wavelet analysis in this paper on WSSIM.
It looks like TinEye uses feature vectors with values for lots of attributes to do their comparison. If you hunt around on their site, you eventually get to the Ideé Labs page, and their FAQ has some (but not too many) specifics on the algorithm:
Q: How does visual search work?
A: Idée’s visual search technology uses sophisticated algorithms to analyze hundreds of image attributes such as colour, shape, texture, luminosity, complexity, objects, and regions.These attributes form a compact digital signature that describes the appearance of each image, and these signatures are calculated by and indexed by our software. When performing a visual search, these signatures are quickly compared by our search engine to return visually similar results.
This is by no means exhaustive (it's just a handful of techniques I've encountered in the course of my own research), but if you google for technical papers or look through proceedings of recent conferences on image processing, you're bound to find more methods for this stuff. It's not a solved problem, but hopefully these pointers will give you an idea of what's involved.
One technique is to use color histograms. You can use machine learning algorithms to find similar images based on the repesentation you use. For example, the commonly used k-means algorithm. I have seen other solutions trying to analyze the vertical and horizontal lines in the image after using edge detection. Texture analysis is also used.
A recent paper clustered images from picasa web. You can also try the clustering algorithm that I am working on.
Consider using lossy wavelet compression and comparing the highest relevance elements of the images.
What TinEye does is a sort of hashing over the image or parts of it (see their FAQ). It's probably not a real hash function since they want similar "hashes" for similar (or nearly identical) images. But all they need to do is comparing that hash and probably substrings of it, to know whether the images are similar/identical or whether one is contained in another.
Heres an image similarity page, but its for polygons. You could convert your image into a finite number of polygons based on color and shape, and run these algorithm on each of them.
here is some code i wrote, 4 years ago in java yikes that does image comparisons using histograms. dont look at any part of it other than buildHistograms()
https://jpicsort.dev.java.net/source/browse/jpicsort/ImageComparator.java?rev=1.7&view=markup
maybe its helpful, atleast if you are using java
Correlation techniques will make a match jump out. If they're JPEGs you could compare the dominant coefficients for each 8x8 block and get a decent match. This isn't exactly correlation but it's based on a cosine transfore, so it's a first cousin.

Resources