Trainable "Spam-Filter" for Images - image

I remember a story about someone filtering images with a spam filter which he fed with some training data.
I come to the point where I exactly need something like this.
I have a lot different types of images (mainly people, e.g. selfies, group pictures, portraits, ..) but I only want a certain type (e.g. only male) of them.
With the right algorithm and training data I think it's possible to get it to the point where I can pass an image to it and i get true or false whether it matches my type or not.
I had a look at a few Face/Gender Detection APIs, but none of them worked for me that's why I want to try the approach with the spam-filter - seems like a funny idea.
Here's what I need:
a trainable spam-filter algorithm/code sample/API
has to work offline
preferably for C# or Java
I already spent a few hours trying different things and googling, now I'm here and I'd like to get your opinion on this problem and the solution you think is appropriate.
Buddha

There is a simple image comparison algorithm that you can read about here: compareImages php class.
Basically the way it works is this:
it takes an image (a cropped image would be best), scales it down to a 8x8 pixels image, converts it to a BW / Greyscale image, and then it calculates the mean value of the pixels (which is the average value).
Then it goes over all the pixels of the scaled image (64 pixels), and in every pixel where the pixel's value >= the mean value, it puts "1", and if the pixel's value < the mean value, it puts "0", resulting in a 64bit "signature" value of 0s and 1s.
This signature value is what identifies the image, and then you can save this signature value in some kind of a database, as your "learned" filter.
Then if an email arrives with some images.. you can just crop them, and scan them, produce a signature, and see if it matches any known signature in your database.
The good things about this algorithm are:
It is very fast and scalable (scaling an image down to 8x8 is fast, and scanning the pixels as described is fast too).
Because it converts the image to greyscale & resizes it down, it means it can detect any color variations or sizes of the same image.
Because you use 64bit signatures, it doesn't take alot of space in your database as well.
Hope this helps.

Related

Equalize contrast and brightness across multiple images

I have roughly 160 images for an experiment. Some of the images, however, have clearly different levels of brightness and contrast compared to others. For instance, I have something like the two pictures below:
I would like to equalize the two pictures in terms of brightness and contrast (probably find some level in the middle and not equate one image to another - though this could be okay if that makes things easier). Would anyone have any suggestions as to how to go about this? I'm not really familiar with image analysis in Matlab so please bear with my follow-up questions should they arise. There is a question for Equalizing luminance, brightness and contrast for a set of images already on here but the code doesn't make much sense to me (due to my lack of experience working with images in Matlab).
Currently, I use Gimp to manipulate images but it's time consuming with 160 images and also just going with subjective eye judgment isn't very reliable. Thank you!
You can use histeq to perform histogram specification where the algorithm will try its best to make the target image match the distribution of intensities / histogram of a source image. This is also called histogram matching and you can read up about it on my previous answer.
In effect, the distribution of intensities between the two images should hopefully be the same. If you want to take advantage of this using histeq, you can specify an additional parameter that specifies the target histogram. Therefore, the input image would try and match itself to the target histogram. Something like this would work assuming you have the images stored in im1 and im2:
out = histeq(im1, imhist(im2));
However, imhistmatch is the more better version to use. It's almost the same way you'd call histeq except you don't have to manually compute the histogram. You just specify the actual image to match itself:
out = imhistmatch(im1, im2);
Here's a running example using your two images. Note that I'll opt to use imhistmatch instead. I read in the two images directly from StackOverflow, I perform a histogram matching so that the first image matches in intensity distribution with the second image and we show this result all in one window.
im1 = imread('http://i.stack.imgur.com/oaopV.png');
im2 = imread('http://i.stack.imgur.com/4fQPq.png');
out = imhistmatch(im1, im2);
figure;
subplot(1,3,1);
imshow(im1);
subplot(1,3,2);
imshow(im2);
subplot(1,3,3);
imshow(out);
This is what I get:
Note that the first image now more or less matches in distribution with the second image.
We can also flip it around and make the first image the source and we can try and match the second image to the first image. Just flip the two parameters with imhistmatch:
out = imhistmatch(im2, im1);
Repeating the above code to display the figure, I get this:
That looks a little more interesting. We can definitely see the shape of the second image's eyes, and some of the facial features are more pronounced.
As such, what you can finally do in the end is choose a good representative image that has the best brightness and contrast, then loop over each of the other images and call imhistmatch each time using this source image as the reference so that the other images will try and match their distribution of intensities to this source image. I can't really write code for this because I don't know how you are storing these images in MATLAB. If you share some of that code, I'd love to write more.

How to know if an images is similar to another (slightly different angle but same point of view)

I've checked methods like Phasher to get similar images. Basically to resize images to 8x8, grayscale, get average pixel and create a binary hash of each pixel comparing if it's above or below the average pixel.
This method is very well explained here:
http://hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
Example working:
- image 1 of a computer on a table
- image 2, the same, but with a coin
This would work, since, using the hash of a very reduced, grayscale image, both of them will be almost the same, or even the same. So you can conclude they are similar when 90% of more of the pixels are the same (in the same place!)
My problem is in images that are taken from the same point of view but different angle, for example this ones:
In this case, the hashes "fingerprint" generated are so shifted each other, that we can not compare the hashes bit by bit, it will be very different.
The pixels are "similar", but they are not in the same place, since in this case there's more sky, and the houses "starts" more below than the first one.
So the hash comparison results in "they are different images".
Possible solution:
I was thinking about creating a larger hash for the first image, then get 10 random "sub hashes" for the second one, and try to see if the 10 sub hashes are or are not in "some place" of the first big hash (if a substring is contained into another bigger).
Problem here I think is the CPU/time when working with thousands of images, since you have to compare 1 image to 1000, and in each one, compare 10 sub hashes with a big one.
Other solutions ? ;-)
One option is to detect a set of "interesting" points for each image and store that alongside your hash. It's somewhat similar to the solution you suggested.
We want those points be unlikely to vary between images like yours that have shifts in perspective. These lecture slides give a good overview of how to find points like that with fairly straightforward linear algebra. I'm using Mathematica because it has built in functions for a lot of this stuff. ImageKeypoints does what we want here.
After we have our interesting points we need to find which ones match between the images we're comparing. If your images are very similar, like the ones in your examples, you could probably just take an 8x8 greyscale image for each interesting point and compare each from one image with the ones for the nearby interesting points on the other image. I think you could use your existing algorithm.
If you wanted to use a more advanced algorithm like SIFT you'd need to have a look at ImageKeypoint's properties like scale and orientation.
The ImageKeypoints documentation has this example you can use to get a small piece of the image for each interesting point (it uses the scale property instead of a fixed size):
MapThread[ImageTrim[img, {#1}, 2.5 #2] &,
Transpose#
ImageKeypoints[img, {"Position", "Scale"},
"KeypointStrength" -> .001]]
Finding a certain number of matching points might be enough to say that the images are similar, but if not you can use something like RANSAC to figure out the transformation you need to align your hash images (the 8x8 images you're already able to generate) enough that your existing algorithm works.
I should point out that Mathematica has ImageCorrespondingPoints, which does all of this stuff (using ImageKeypoints) much better. But I don't know how you could have it cache the intermediate results so that scales for what you're trying to do. You might want to look into its ability to constrain matching points to a perspective transform, though.
Here's a plot of the matching points for your example images to give you an idea of what parts end up matching:
So you can precalculate the interesting points for your database of images, and the greyscale hashes for each point. You'll have to compare several hash images for each image in your database, rather than just two, but it will scale to within a constant factor of your current algorithm.
You can try an upper bound if the hashes doesn't match compare how many pixels match from the 8x8 grid. Maybe you can try to match the colors like in photo mosaic:Photo Mosaic Algorithm. How to create a mosaic photo given the basic image and a list of tiles?.

Mysql Algorithm for Determining Closest Colour Match

I'm attempting to create a true mosaic application. At the moment I have one mosaic image, ie the one the mosaic is based on and about 4000 images from my iPhoto library that act as the image library. I have already done my research and analysed the mosaic image. I've converted it into 64x64 slices each of 8 pixels. I've calculated the average colour for each slice and assertain the r, g, b and brightness (Luminance (perceived option 1) = (0.299*R + 0.587*G + 0.114*B)) value. I have done the same for each of the image library photos.
The mosaic slices table looks like so.
slice_id, slice_image_id, slice_slice_id, slice_image_column, slice_image_row, slice_colour_hex, slice_rgb_red, slice_rgb_blue, slice_rgb_green, slice_rgb_brightness
The image library table looks like so.
upload_id, upload_file, upload_colour_hex, upload_rgb_red, upload_rgb_green, upload_rgb_blue, upload_rgb_brightness
So basically I'm reading the image slices from the slices table into PHP and then pulling out the appropriate images from the library table based on the colour hexs. My trouble is that I've been on this too long and probably had too many energy drinks so am not concentrating properly, I can't figure out the way to pick out the nearest colour neighbor if the appropriate hex code doesn't exist.
Any ideas on the perfect query?
NB: I know pulling out the slices one by one is not ideal however the mosaic is only rebuilt periodically so a sudden burst in the mysql load doesn't really bother me, however if there us a way to pull the images out all at once that would also be a massive bonus.
Update Brightness Comparisons.
With Brightness
(source: buggedcom.co.uk)
Without Brightness
(source: buggedcom.co.uk)
One way to minimize the difference between the colours (in terms of their RGB components) is you would individually minimize the difference in each component. Thus you're looking for the entry with lowest
(targetRed - rowRed)^2 + (targetGreen - rowGreen)^2 + (targetBlue - rowBlue)^2
I think that you may be better off using HSL instead of RGB as color space. Formulas to compute HSL from RGB are available on the internet (and in the linked Wikipedia article), they may give you what you need to compute the best match.

Image fingerprint to compare similarity of many images

I need to create fingerprints of many images (about 100.000 existing, 1000 new per day, RGB, JPEG, max size 800x800) to compare every image to every other image very fast. I can't use binary compare methods because also images which are nearly similar should be recognized.
Best would be an existing library, but also some hints to existing algorithms would help me a lot.
Normal hashing or CRC calculation algorithms do not work well with image data. The dimensional nature of the information must be taken into account.
If you need extremely robust fingerprinting, such that affine transformations (scaling, rotation, translation, flipping) are accounted for, you can use a Radon transformation on the image source to produce a normative mapping of the image data - store this with each image and then compare just the fingerprints. This is a complex algorithm and not for the faint of heart.
a few simple solutions are possible:
Create a luminosity histogram for the image as a fingerprint
Create scaled down versions of each image as a fingerprint
Combine technique (1) and (2) into a hybrid approach for improved comparison quality
A luminosity histogram (especially one that is separated into RGB components) is a reasonable fingerprint for an image - and can be implemented quite efficiently. Subtracting one histogram from another will produce a new historgram which you can process to decide how similar two images are. Histograms, because the only evaluate the distribution and occurrence of luminosity/color information handle affine transformations quite well. If you quantize each color component's luminosity information down to an 8-bit value, 768 bytes of storage are sufficient for the fingerprint of an image of almost any reasonable size. Luminosity histograms produce false negatives when the color information in an image is manipulated. If you apply transformations like contrast/brightness, posterize, color shifting, luminosity information changes. False positives are also possible with certain types of images ... such as landscapes and images where a single color dominates others.
Using scaled images is another way to reduce the information density of the image to a level that is easier to compare. Reductions below 10% of the original image size generally lose too much of the information to be of use - so an 800x800 pixel image can be scaled down to 80x80 and still provide enough information to perform decent fingerprinting. Unlike histogram data, you have to perform anisotropic scaling of the image data when the source resolutions have varying aspect ratios. In other words, reducing a 300x800 image into an 80x80 thumbnail causes deformation of the image, such that when compared with a 300x500 image (that's very similar) will cause false negatives. Thumbnail fingerprints also often produce false negatives when affine transformations are involved. If you flip or rotate an image, its thumbnail will be quite different from the original and may result in a false positive.
Combining both techniques is a reasonable way to hedge your bets and reduce the occurence of both false positives and false negatives.
There is a much less ad-hoc approach than the scaled down image variants that have been proposed here that retains their general flavor, but which gives a much more rigorous mathematical basis for what is going on.
Take a Haar wavelet of the image. Basically the Haar wavelet is the succession of differences from the lower resolution images to each higher resolution image, but weighted by how deep you are in the 'tree' of mipmaps. The calculation is straightforward. Then once you have the Haar wavelet appropriately weighted, throw away all but the k largest coefficients (in terms of absolute value), normalize the vector and save it.
If you take the dot product of two of those normalized vectors it gives you a measure of similarity with 1 being nearly identical. I posted more information over here.
You should definitely take a look at phash.
For image comparison there is this php project :
https://github.com/kennethrapp/phasher
And my little javascript clone:
https://redaktor.me/phasher/demo_js/index.html
Unfortunately this is "bitcount"-based but will recognize rotated images.
Another approach in javascript was to build a luminosity histogram from the image by the help of canvas. You can visualize a polygon histogram on the canvas and compare that polygon in your database (e.g. mySQL spatial ...)
A long time ago I worked on a system that had some similar characteristics, and this is an approximation of the algorithm we followed:
Divide the picture into zones. In our case we were dealing with 4:3 resolution video, so we used 12 zones. Doing this takes the resolution of the source images out of the picture.
For each zone, calculate an overall color - the average of all pixels in the zone
For the entire image, calculate an overall color - the average of all zones
So for each image, you're storing n + 1 integer values, where n is the number of zones you're tracking.
For comparisons, you also need to look at each color channel individually.
For the overall image, compare the color channels for the overall colors to see if they are within a certain threshold - say, 10%
If the images are within the threshold, next compare each zone. If all zones also are within the threshold, the images are a strong enough match that you can at least flag them for further comparison.
This lets you quickly discard images that are not matches; you can also use more zones and/or apply the algorithm recursively to get stronger match confidence.
Similar to Ic's answer - you might try comparing the images at multiple resolutions. So each image get saved as 1x1, 2x2, 4x4 .. 800x800. If the lowest resolution doesn't match (subject to a threshold), you can immediately reject it. If it does match, you can compare them at the next higher resolution, and so on..
Also - if the images share any similar structure, such as medical images, you might be able to extract that structure into a description that is easier/faster to compare.
As of 2015 (back to the future... on this 2009 question which is now high-ranked in Google) image similarity can be computed using Deep Learning techniques. The family of algorithms known as Auto Encoders can create a vector representation which is searchable for similarity. There is a demo here.
One way you can do this is to resize the image and drop the resolution significantly (to 200x200 maybe?), storing a smaller (pixel-averaged) version for doing the comparison. Then define a tolerance threshold and compare each pixel. If the RGB of all pixels are within the tolerance, you've got a match.
Your initial run through is O(n^2) but if you catalog all matches, each new image is just an O(n) algorithm to compare (you only have to compare it to each previously inserted image). It will eventually break down however as the list of images to compare becomes larger, but I think you're safe for a while.
After 400 days of running, you'll have 500,000 images, which means (discounting the time to resize the image down) 200(H)*200(W)*500,000(images)*3(RGB) = 60,000,000,000 comparisons. If every image is an exact match, you're going to be falling behind, but that's probably not going to be the case, right? Remember, you can discount an image as a match as soon as a single comparison falls outside your threshold.
Do you literally want to compare every image against the others? What is the application? Maybe you just need some kind of indexing and retrieval of images based on certain descriptors? Then for example you can look at MPEG-7 standard for Multimedia Content Description Interface. Then you could compare the different image descriptors, which will be not that accurate but much faster.
So you want to do "fingerprint matching" that's pretty different than "image matching". Fingerprints' analysis has been deeply studied during the past 20 years, and several interesting algorithms have been developed to ensure the right detection rate (with respect to FAR and FRR measures - False Acceptance Rate and False Rejection Rate).
I suggest you to better look to LFA (Local Feature Analysis) class of detection techniques, mostly built on minutiae inspection. Minutiae are specific characteristics of any fingerprint, and have been classified in several classes. Mapping a raster image to a minutiae map is what actually most of Public Authorities do to file criminals or terrorists.
See here for further references
For iPhone image comparison and image similarity development check out:
http://sites.google.com/site/imagecomparison/
To see it in action, check out eyeBuy Visual Search on the iTunes AppStore.
It seems that specialised image hashing algorithms are an area of active research but perhaps a normal hash calculation of the image bytes would do the trick.
Are you seeking byte-identical images rather than looking for images that are derived from the same source but may be a different format or resolution (which strikes me as a rather hard problem).

Detecting if two images are visually identical

Sometimes two image files may be different on a file level, but a human would consider them perceptively identical. Given that, now suppose you have a huge database of images, and you wish to know if a human would think some image X is present in the database or not. If all images had a perceptive hash / fingerprint, then one could hash image X and it would be a simple matter to see if it is in the database or not.
I know there is research around this issue, and some algorithms exist, but is there any tool, like a UNIX command line tool or a library I could use to compute such a hash without implementing some algorithm from scratch?
edit: relevant code from findimagedupes, using ImageMagick
try $image->Sample("160x160!");
try $image->Modulate(saturation=>-100);
try $image->Blur(radius=>3,sigma=>99);
try $image->Normalize();
try $image->Equalize();
try $image->Sample("16x16");
try $image->Threshold();
try $image->Set(magick=>'mono');
($blob) = $image->ImageToBlob();
edit: Warning! ImageMagick $image object seems to contain information about the creation time of an image file that was read in. This means that the blob you get will be different even for the same image, if it was retrieved at a different time. To make sure the fingerprint stays the same, use $image->getImageSignature() as the last step.
findimagedupes is pretty good. You can run "findimagedupes -v fingerprint images" to let it print "perceptive hash", for example.
Cross-correlation or phase correlation will tell you if the images are the same, even with noise, degradation, and horizontal or vertical offsets. Using the FFT-based methods will make it much faster than the algorithm described in the question.
The usual algorithm doesn't work for images that are not the same scale or rotation, though. You could pre-rotate or pre-scale them, but that's really processor intensive. Apparently you can also do the correlation in a log-polar space and it will be invariant to rotation, translation, and scale, but I don't know the details well enough to explain that.
MATLAB example: Registering an Image Using Normalized Cross-Correlation
Wikipedia calls this "phase correlation" and also describes making it scale- and rotation-invariant:
The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation.
Colour histogram is good for the same image that has been resized, resampled etc.
If you want to match different people's photos of the same landmark it's trickier - look at haar classifiers. Opencv is a great free library for image processing.
I don't know the algorithm behind it, but Microsoft Live Image Search just added this capability. Picasa also has the ability to identify faces in images, and groups faces that look similar. Most of the time, it's the same person.
Some machine learning technology like a support vector machine, neural network, naive Bayes classifier or Bayesian network would be best at this type of problem. I've written one each of the first three to classify handwritten digits, which is essentially image pattern recognition.
resize the image to a 1x1 pixle... if they are exact, there is a small probability they are the same picture...
now resize it to a 2x2 pixle image, if all 4 pixles are exact, there is a larger probability they are exact...
then 3x3, if all 9 pixles are exact... good chance etc.
then 4x4, if all 16 pixles are exact,... better chance.
etc...
doing it this way, you can make efficiency improvments... if the 1x1 pixel grid is off by a lot, why bother checking 2x2 grid? etc.
If you have lots of images, a color histogram could be used to get rough closeness of images before doing a full image comparison of each image against each other one (i.e. O(n^2)).
There is DPEG, "The" Duplicate Media Manager, but its code is not open. It's a very old tool - I remember using it in 2003.
You could use diff to see if they are REALLY different.. I guess it will remove lots of useless comparison. Then, for the algorithm, I would use a probabilistic approach.. what are the chances that they look the same.. I'd based that on the amount of rgb in each pixel. You could also find some other metrics such as luminosity and stuff like that.

Resources