I am trying to come up with a way to classify if the given image contains a Red Car.
The possible outcomes of the classifier should be:
Image contains a CAR and it is RED. (Desired case)
All others where Image contains a CAR but it is NOT RED or image does not contain any Cars at all.
I know how to implement a Convolutionary NN that can classify if an image contains a CAR or not.
But I am having trouble on how to implement a fine grained image classification for this where the classifier should only identify Red Cars and ignore all other images that may contain Cars or no cars in the image.
I read the following papers but since my use case is much limited than finding similarities as proposed in the papers I am trying to see if there is a simple approach to implement this.
Fast Training of Triplet-based Deep Binary Embedding Networks
Learning Fine-grained Image Similarity with Deep Ranking
Thanks for your help.
Just treat it as a classification problem with two classes: "Red car" - "No red car". Label every instance of your training data this way. There is no need to train a "car" classifier first.
I know how to implement a Convolutionary NN that can classify if an image contains a CAR or not.
Good. Then this should be done within seconds (+ time for labeling).
I read the following papers but since my use case is much limited than finding similarities as proposed in the papers I am trying to see if there is a simple approach to implement this.
Fast Training of Triplet-based Deep Binary Embedding Networks
Learning Fine-grained Image Similarity with Deep Ranking
Yes, simply treating it as a classification problem as described above. If you need a starter, have a look at the Tensorflow Cifar10 tutorial.
Related
I am studying neural networks, or more specifically image classification at the moment. While I was reading, I was wondering if the following has ever been done/ is doable. If anybody could point me to some sources or ideas, I'd appreciate it!
In a traditional neural network, you have a training data set of images and the weights of the neurons in the network. The goal is to optimize the weights so that the classification of the images is accurate for training data and new images is as good as possible.
I was wondering if you could reverse this:
Given a neural network and the weights to its neurons, generate a set of images corresponding to the classes that the network separates, i.e., a proto-type of the kinds of images this specific network is able to classify well.
In my mind it would work as follows (I'm sure this is not quite achievable, but just to get the idea across):
Imagine a neural network that is able to classify images containing labels cat, dog and neither of those.
What I want is the "inverse", i.e. an image of a cat, an image of a dog and one that is "furthest away" from the other two classes.
I think that this could be done by generating images and minimizing the loss function for one specific class, while maximizing it for all other classes at the same time.
Is this kind of how Google Deep Dream visualizes what it is "dreaming"?
I hope it is clear what I mean, if not I will answer any questions.
Is this kind of how Google Deep Dream visualizes what it is "dreaming"?
Pretty much, it seems, at least that's how the people behind it explain it:
One way to visualize what goes on [in a neural network layer] is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular interpretation. Say you want to know what sort of image would result in “Banana.” Start with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana (see related work [...]). By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring pixels needing to be correlated.
Source - The whole blog post is worth reading.
I think you can understand the mainstream approach from Karpathy's blog:
http://karpathy.github.io/2015/03/30/breaking-convnets/
Normal ConvNet training: "What happens to the score of the correct class when I wiggle this parameter?"
Creating fooling images: "What happens to the score of (whatever class you want) when I wiggle this pixel?"
Fooling the classifier with an image is very close to what you ask. For your goal you need to add some regularization to your loss function to avoid fully misleading results - the absolute minimum loss can be very distorted picture.
I have a hard problem to solve which is about automatic image keywording. You can assume that I have a database with 100000+ keyworded low quality jpeg images for training (low quality = low resolution about 300x300px + low compression ratio). Each image has about 40 mostly accurate keywords (data may contain slight "noise"). I can also extract some data on keyword correlations.
Given a color image and a keyword, I want to determine the probability that the keyword is related to this image.
I need a creative understandable solution which I could implement on my own in about a month or less (I plan to use python). What I found so far is machine learning, neural networks and genetic algorithms. I was also thinking about generating some kind of signatures for each keyword which I could then use to check against not yet seen images.
Crazy/novel ideas are appreciated as well if they are practicable. I'm also open to using other python libraries.
My current algorithm is extremely complex and computationally heavy. It suggests keywords instead of calculating probability and 50% of suggested keywords are not accurate.
Given the hard requirements of the application, only gross and brainless solutions can be proposed.
For every image, use some segmentation method and keep, say, the four largest segments. Distinguish one or two of them as being background (those extending to the image borders), and the others as foreground, or item of interest.
Characterize the segments in terms of dominant color (using a very rough classification based on color primaries), and in terms of shape (size relative to the image, circularity, number of holes, dominant orientation and a few others).
Then for every keyword you can build a classifier that decides if a given image has/hasn't this keyword. After training, the classifiers will tell you if the image has/hasn't the keyword(s). If you use a fuzzy classification, you get a "probability".
I want to use bag of words for content-based image retrieval.
I'm confused as to how to apply bag-of-words to content based image retrieval.
To clarify:
I've trained my program using SURF features and extract the BoW descriptors. I feed this to a support vector machine as training data. Then, given a query image, the support vector machine can predict which class a given image belongs to.
In other words, given a query image it can find a class. For example, given a query image of a car, the program would return 'car'. How would one find similar images?
Would I, given the class, return images from the training set? Or would the program - given a query image - also return a subset of a test-set on which the SVM predicts the same class?
The title only mentions BoW, but in your text you also use SVMs.
I think the core idea of CBIR is, to find the most similar image, according to some distance measure. You can do this with BoW-features. The SVM is not necessary.
The main purpose of using additional classification is to speed up the process. Because after you obtained a class label for your test image, you only need to search this subgroup of your images for the best match. And of course, if the SVM is better in distinguishing certain classes than your distance measure, it might help to reduce errors.
So the standard workflow would be:
obtain the class
return the best match from the training samples of this class
I am looking for a solution to do the following:
( the focus of my question is step 2. )
a picture of a house including the front yard
extract information from the picture like the dimensions and location of the house, trees, sidewalk, and car. Also, the textures and colors of the house, cars, trees, and sidewalk.
use extracted information to generate a model
How can I extract that information?
You could also consult Tatiana Jaworska research on this. As I understood, this details at least 1 new algorithm to feature extraction (targeted at roofs, doors, ...) by colour (RGB). More intriguing, the last publication also uses parameterized objects to be identified in the house images... that must might be a really good starting point for what you're trying to do.
link to her publications:
http://www.springerlink.com/content/w518j70542780r34/
http://portal.acm.org/citation.cfm?id=1578785
http://www.ibspan.waw.pl/~jaworska/TJ_BOS2010.pdf
Yes. You can extract these information from a picture.
1. You just identify these objects in a picture using some detection algorithms.
2. Measure these objects dimensions and generate a model using extracted information.
well actually your desired goal is not so easy to achieve. First of all you'll need a good way to figure what what is what and what is where on your image. And there simply is no easy "algorithm" for detecting houses/cars/whatsoever on an image. There are ways to segment different objects (like cars) from an image, but those don't work generally. Especially on houses this would be hard since each house looks different and it's hard to find one solid measurement for "this is house and this is not"...
Am I assuming it right that you are trying to simply photograph a house (with front yard) and build a texturized 3D-model out of it? This is not going to work since you need several photos of the house to get positions of walls/corners and everything in 3D space (There are approaches that try a mesh reconstruction with one image only but they lack of depth information and results are fairly poor). So if you would like to create 3D-mdoels you will need several photos of different angles of the house.
There are several different approaches that use this kind of technique to reconstruct real world objects to triangle-meshes.
Basically they work after the principle:
Try to find points in images of different viewpoint which are the same on an object. Considering you are photographing a house this could be salient structures likes corners of windows/doors or corners or edges on the walls/roof/...
Knowing where one and the same point of your house is in several different photos and knowing the position of the camera of both photos you can reconstruct this point in 3D-space.
Doing this for a lot of equal points will "empower" you to reconstruct the shape of your house as a 3D-model by triangulating the points.
Taking parts of the image as textures and mapping them on the generated model would work as well since you know where what is.
You should have a look at these papers:
http://www.graphicon.ru/1999/3D%20Reconstruction/Valiev.pdf
http://people.csail.mit.edu/wojciech/pubs/LabeledRec.pdf
http://people.csail.mit.edu/sparis/publi/2006/oceans/Paris_06_3D_Reconstruction.ppt
The second paper even has an example of doing exactly what you try to achieve, namely reconstruct a textured 3D-model of a house photographed from different angles.
The third link is a powerpoint presentation that shows how the reconstruction works and shows the drawbacks there are.
So you should get familiar with these papers to see what problems you are up to... If you then want to try this on your own have a look at OpenCV. This library provides some methods for feature extraction in images. You then can try to find salient points in each image and try to match them.
Good luck on your project... If you have problems, please keep asking!
I suggest to look at this blog
https://jwork.org/main/node/35
that shows how to identify certain features on images using a convolutional neural network. This particular blog discusses how to identify human faces on images from a large set of random images. You can adjust this example to train neural network using some other images. Note that even in the case of human faces, the identification rate is about 85%, therefore, more complex objects can be even harder to identify
I've been trying to understand the AdaBoost algorithm without much success. I'm struggling with understanding the Viola Jones paper on Face Detection as an example.
Can you explain AdaBoost in laymen's terms and present good examples of when it's used?
Adaboost is an algorithm that combines classifiers with poor performance, aka weak learners, into a bigger classifier with much higher performance.
How does it work? In a very simplified manner:
Train a weak learner.
Add it to the set of weak learners trained so far (with an optimal weight)
Increase the importance of samples that are still miss-classified.
Go to 1.
There is a broad and detailed theory behind the scenes, but the intuition is just that: let each "dumb" classifier focus on the mistakes the previous ones were not able to fix.
AdaBoost is one of the most used algorithms in the machine learning community. In particular, it is useful when you know how to create simple classifiers (possibly many different ones, using different features), and you want to combine them in an optimal way.
In Viola and Jones, each different type of weak-learner is associated to one of the 4 or 5 different Haar features you can have.
AdaBoost uses a number of training sample images (such as faces) to pick a number of good 'features'/'classifiers'. For face recognition a classifiers is typically just a rectangle of pixels that has a certain average color value and a relative size. AdaBoost will look at a number of classifiers and find out which one is the best predictor of a face based on the sample images. After it has chosen the best classifier it will continue to find another and another until some threshold is reached and those classifiers combined together will provide the end result.
This part you may not want to share with non-technical people :) but it is interesting anyway. There are several mathematical tricks which make AdaBoost fast for face recognition such as the ability to add up all the color values of an image and store them in a 2 dimensional array so that the value in any position will be the sum of all the pixels up and to the left of that position. This array can be used to very quickly calculate the average color value of any rectangle within the image by subtracting the value found in the top left corner from the value found in the bottom right corner and dividing by the number of pixels in the rectangle. Using this trick you can quickly scan over an entire image looking for rectangles of different relative sizes that match or are close to a particular color.
Hope this helps.
This is understandable. Most of the papers you can find on Internet retell Viola-Jones and Freund-Shapire papers which are foundation of AdaBoost applied for face recognition in OpenCV. And they mostly consist of difficult formulas and algorithms from several mathematical areas combined.
Here is what can help you (short enough) -
1 - It is used in object and, mostly, in face detection-recognition.The most popular and quite good C++ library is OpenCV from Intel originally. I take the part of Face detection in OpenCV, as an example.
2 - First, a cascade of boosted classifiers working with sample rectangles ("features") is trained on sample of images with faces (called positive) and without faces (negative).
From some Googled paper:
"· Boosting refers to a general and provably effective method of producing a very accurate classifier by combining rough and moderately inaccurate rules of thumb.
· It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate classifier.
· To begin, we define an algorithm for finding the rules of thumb, which we call a weak learner.
· The boosting algorithm repeatedly calls this weak learner, each time feeding it a different distribution over the training data (in AdaBoost).
· Each call generates a weak classifier and we must combine all of these into a single classifier that, hopefully, is much more accurate than any one of the rules."
During this process the images are scanned to determine the distinctive areas corresponding to certain part of every face. The complex calculation-hypothesis based algorithms are applied (which are not so difficult to understand once you get the main idea).
This can take a week and the output is an XML file which contains the learned information on how to quickly detect the human face, say, in frontal position on any picture (it can be any object in other case).
3 - After that you supply this file to OpenCV face detection program which runs quite fast with up to 99% positive rate (depending on conditions).
As was mentioned here, the scanning speed can be increased greatly with technique known as "integral image".
And finally, these are helpful sources - Object Detection in OpenCV and
Generic Object Detection using AdaBoost from University of California, 2008.