Scikit-learn. Classify disordered jpgs - image

How would you approach the following problem: I have 5 classes of images (in total 500 images): car, house, trees, chair and face. Then I have a folder with 20 disordered images, which means I know they belong to one of the 5 classes but do not know yet to which one and I want my system to classify them according to the 5 controlled classes. I am using several extractors (hue,edge) to accomplish this task, but I am struggling to get a suitable classification approach. In particular some python libraries require to name the uncontrolled image folder in the same way as the class folder (e.g. /dir/controlled/car and /dir/uncontrolled/car) this simply is not feasible for my analysis. As far as I am looking for alternative approaches can you give some methodological advice/workaround within sklearn?

Maybe it would be easier to use a labeled dataset such as ImageNet to first train classifier on those 5 classes (+1 additional "misc" class that you would fill with random images not from those 5 classes).
Take as many examples as you can from image net to build your training set while keeping the classes approximately balanced. For instance imagenet has almost 8000 car pictures: http://www.image-net.org/synset?wnid=n02958343 but only around 1500 faces: http://www.image-net.org/synset?wnid=n02958343 . Some classifier might not work good in that case to subsampling the car class might yield better results in terms of f1 score. Unless you find another source of pictures of faces.
Once you find a set of parameters for feature extraction + classifier chain that yields good cross validated score on your ImageNet subset, retrain a model on that full subset and apply it to predict the labels of your own dataset.
Choose a classifier that give you confidence scores (e.g. with a method such as predict_proba or decision_function) and introspect the quality of classifications with the highest and lowest confidence scores:
if all the highest classification are correct, add all the pictures above some safe threshold to a "staged two" training set that comprises the original imagenet subset and those new pictures.
re-annotate manually the most offending mistakes in the lowest confidence predictions and add them to the "staged two" training set.
Iterate by retraining a new model on this enriched dataset until the classification algorithm is able to correctly annotate most of your pictures correctly.
BTW, don't change the parameters too much once you start annotating your data and iterating with the classifier to avoid overfitting. If you want to re-do parameter selection, you should do cross validation again.

Related

machine learning: Image classification into 3 classes (Dog or Cat or Neither) using Convolutional NN

I would appreciate a bit of help in thinking this through. I have a classifier that can categorize the images into either dog or cat successfully with good accuracy. I have a good data set to train the classifier on. So far no problem.
I have about 20,000 dog and 20,000 cat images.
However, when I try to present other images like a car or a building or a tiger that do not have either dog or cat, I would like the output of the classifier to be "Niether". Right now obviously, the classifier tries to classify everything into a Dog or Cat which is not correct.
Question 1:
How can I achieve this? Do I need to have a 3 set of images that do not contain dog or cat and train the classifier on these additional images to recognize everything else as "Neither"?
At a high level approximately, How many images of the non Dog/Cat category would I need to get good accuracy? Would about 50,000 images do since the non dog/cat images domain is so huge? or do I need even more images?
Question 2:
Instead of training my own classifier using my own image data, can I use Imagenet trained VGG16 Keras model for the initial layer and add the DOG/CAT/Neither classifier on top as the Fully connected layer?
See this example to load a pre-traied imagenet model
Thanks much for your help.
Question 2
I'll take the "killer" heuristic first. Yes, use the existing trained model. Simply conglomerate all of the dog classifications into your class 1, the cats into class 2, and everything else into class 0. This will solve virtually all of your problem.
Question 1
The problem is that your initial model has been trained that everything in the world (all 40,000 images) is either a dog or a cat. Yes, you have to train a third set, unless your training method is a self-limiting algorithm, such as a single-class SVM (run once on each classification). Even then, I expect that you'd have some trouble excluding a lynx or a wolf.
You're quite right that you'll need plenty of examples for the "neither" class, given the high dimension of the input space: it's not so much the quantity of images, but their placement just "over the boundary" from a cat or dog. I'd be interested in a project to determine how to do this with minimal additional input.
In short, don't simply grab 50K images from the ImageNet type of the world; choose those that will give your model the best discrimination: other feline and canine examples, other objects you find in similar environments (end table, field rodent, etc.).

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.
Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.
If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

Negative Training Image Examples for CNN

I am using the Caffe framework for CNN training. My aim is to perform simple object recognition for a few basic object categories. Since pretrained networks are not an alternative for my proposed usage I prepared an own training- and testset with about 1000 images for each of 2 classes (say chairs and cars).
The results are quite good. If I present an yet unseen image of a chair it is likely classified as such, same for a car image. My problem is that the results on miscellaneous images that do not show any of these classes often shows a very high confidence (=1) for one random class (which is not surprising regarding the onesided training data but a problem for my application). I thought about different solutions:
1) Adding a third class with also about 1000 negative examples that shows any objects except a chair and a car.
2) Adding more object categories in general, just to let the network classify other objects as such and not any more as a chair or car (of course this would require much effort). Maybe also the broader prediction results would show a more uniform distribution at negative images, allowing to evaluate the target objects presence based on a threshold?
Because it was not much time-consuming to grab random images as negative examples from the internet, I already tested my first solution with about 1200 negative examples. It helped, but the problem remains, perhaps because it were just too few? My concern is that if I increment the number of negative examples, the imbalance of the number of examples for each class leads to less accurate detection of the original classes.
After some research I found one person with a similar problem, but there was no solution:
Convolutional Neural Networks with Caffe and NEGATIVE IMAGES
My question is: Has anyone had the same problem and knows how to deal with it? What way would you recommend, adding more negative examples or more object categories or do you have any other recommendation?
The problem is not unique to Caffe or ConvNets. Any Machine Learning technique runs this risk. In the end, all classifiers take a vector in some input space (usually very high-dimensional), which means they partition that input space. You've given examples of two partitions, which helps to estimate the boundary between the two, but only that boundary. Both partitions have very, very large boundaries, precisely because the input space is so high-dimensional.
ConvNets do try to tackle the high-dimensionality of image data by having fairly small convolution kernels. Realistic negative data helps in training those, and the label wouldn't really matter. You could even use the input image as goal (i.e. train it as an autoencoder) when training the convolution kernels.
One general reason why you don't want to lump all counterexamples is because they may be too varied. If you have a class A with some feature value from the range [-1,+1] on some scale, with counterexamples B [-2,-1] and C [+1,+2], lumping B and C together creates a range [-2,+2] for counterexamples which overlaps the real real range. Given enough data and powerful enough classifiers, this is not fatal, but for instance an SVM can fail badly on this.

Steps for age classification

I am working on age (or gender) classification using images of human faces. I have decided to use the LBP (Local Binary Patterns) approach for feature extraction and Support Vector Machines (SVM) for freature classification. The whole process is shown in Fig. 1. Below.
As I understand it, the procedure is as follows:
Start with a training set that includes 3 groups: Chidren, Young, Senior. Each group has 50 images (150 images total). Use LBP to prepare the 150 images for classification.
Train a SVM on 150 LBP images with labels:
0: Child
1: Young Adult
2: Senior
Test the system using a set of new images. If all goes according to plan, the system should properly classify images based on the groups defined in step 2.
The algorithm:
for i=1 to N //Assume N is number of image
LBP_feature[i]=LBP_extract(image_i)
end
//Training stage
SVM.train(LBP_feature,label);
//Test stage
face=getFromCamera
//Extract LBP from the face
face_LBP=LBP_extract(face)
label=SVM.predict(face_LBP)
if label=0 then Children
if label=1 then Young
if label=2 then Senior
Does the proposed system make sense for this task?
If you want to use support vector machines, and you also want to consider an image to be a "sample" of subregions, then so-called "support distribution machines" developed by Jeff Schneider and Barnabas Poczos might be best suited for your problem (paper and documentation available online). They actually showed that with some tweaks, support distribution machines outperformed all state-of-the-art methods for a certain popular image classification data set. They used SIFT (sp?) features and then each image was a collection of samples (subregion patches) from the feature space, and then "support distribution machines" are kernel-based SVMs that estimate a divergence kernel between two distributions by using a sample-based estimator.
If you want to use SVMs like support distribution machines, there is one final point to consider. SVMs are two-class classifiers. In order to extend to more than 2 classes, you can either train an SVM that classifies one class versus the union of the rest of the classes, for each choice of class (so N SVMs if you have N classes), and then you run each SVM and choose the class with the highest classification score. Another method, however, is to train an SVM for each pair of classes (so N(N-1)/2 SVMs for N classes) and then try to choose the best class by getting a "consensus" of all the pairwise comparisons. You can read about all this online and choose whichever method you think is best, or whichever method gives the best leave-one-out cross validation performance on the training data. (which should be easy to calculate because you only have 150 training points)
On paper, the approach makes sense. The most important point is whether the LBP is the right feature for this task. You can first extract the LBP using different parameters (image size, bin count if you are using LBP histogram, etc.) and observe the data using a tool like Weka or R to see if your sample data for different classes exhibit different distributions.
You can also refer to a few research papers on age estimation to see what other features are suitable. I have tried Radon transform with some success, for seniors. The wrinkles in faces are well represented in Radon transform.

Classifying Multivariate Time Series

I currently am working on a time series witch 430 attributes and approx. 80k instances. Now I would like to binary classify each instance (not the whole ts). Everything I found about classifying TS talked about labeling the whole thing.
Is it possible to classify each instance with something like a SVM completely disregarding the sequential nature of the data or would that only result in a really bad classifier?
Which other options are there which classify each instance but still look at the data as a time series?
If the data is labeled, you may have luck by concatenating attributes together, so each instance becomes a single long time series, and by applying the so-called Shapelet Transform. This would result in a vector of values for each of time series which can be fed into SVM, Random Forest, or any other classifier. It could be that picking a right shapelets will allow you to focus on a single attribute when classifying instances.
If it is not labeled, you may try the unsupervised shapelets application first to explore your data and proceed with aforementioned shapelet transform after.
It certainly depends on the data within the 430 attributes,
data types, and especially the problem you want to solve.
In time series analysis, you usually want to exploit the dependencies between the neighboring points, i.e., how they change in time. The examples you may find in books usually talk about a single function f(t): Time -> Real. If I understand it correctly, you want to focus just on the dependencies among the 430 attributes (vertical dependencies) and disregard the horizontal dependencies.
If I were you, I would first try to train multiple classifiers (SVM, Maximum entropy model, Multi-layer perceptron, Random forest, Probabilistic Neural Network, ...) and compare their prediction performance in the frame of your problem.
For training, you can start by feeding all 430 attributes as features to Maxent classifier (can easily handle millions of features).
You also need to perform some N-fold cross-validation to see whether the classifiers are not overfitted. Then pick the best that solves your problem "good enough".
Other ideas if this approach does not perform well:
include features from t-1, t-2...
perform feature selection by trying different subsets of features
derive new time series such as moving averages, wavelet spectrum ... and use them as new features
A nice implementation of Maxent classifier can be found in openNLP.

Resources