Low probabilities when using xgboost on multiclass problem - probability

I am using a xgbclassifier to do text classification with more than two classes. Reaching about ~65% accuracy I looked into the probabilities the model is outputting. For no test example I am showing to the model it is outputting more than 0.3 for any given class. Even when the model is correct it is therefore choosing a class on the difference of about 20%.
Is that something I should be worried about? I would expect the model to be sure (therefore outputting around 90%) at least in some cases. Is there even such an easy interpretation of the output probabilities? Or should I not be worried about the output probabilities as long as the class is correct?
Edit: I have around a 100 classes, which are also imbalanced, roughly 3 categories take up 70% of the whole data. The sizes more or less decrease linearly.
The data itself are german texts, if anybody is interested.

Welcome to SO!
In the absence of any data sample or code it is hard to comment on what the issues are.
What are the class distributions in your data?
Say for example you had five classes equally distributed ie about 20% each.
Then getting an output of 0.20 for some observations in a particular class could well be highly significant ie the model is pretty sure about this allocation.
Is it possible to post some data and code, if the data is sensitive then anonymize it.

Related

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Updates:
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.
Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

Negative Training Image Examples for CNN

I am using the Caffe framework for CNN training. My aim is to perform simple object recognition for a few basic object categories. Since pretrained networks are not an alternative for my proposed usage I prepared an own training- and testset with about 1000 images for each of 2 classes (say chairs and cars).
The results are quite good. If I present an yet unseen image of a chair it is likely classified as such, same for a car image. My problem is that the results on miscellaneous images that do not show any of these classes often shows a very high confidence (=1) for one random class (which is not surprising regarding the onesided training data but a problem for my application). I thought about different solutions:
1) Adding a third class with also about 1000 negative examples that shows any objects except a chair and a car.
2) Adding more object categories in general, just to let the network classify other objects as such and not any more as a chair or car (of course this would require much effort). Maybe also the broader prediction results would show a more uniform distribution at negative images, allowing to evaluate the target objects presence based on a threshold?
Because it was not much time-consuming to grab random images as negative examples from the internet, I already tested my first solution with about 1200 negative examples. It helped, but the problem remains, perhaps because it were just too few? My concern is that if I increment the number of negative examples, the imbalance of the number of examples for each class leads to less accurate detection of the original classes.
After some research I found one person with a similar problem, but there was no solution:
Convolutional Neural Networks with Caffe and NEGATIVE IMAGES
My question is: Has anyone had the same problem and knows how to deal with it? What way would you recommend, adding more negative examples or more object categories or do you have any other recommendation?
The problem is not unique to Caffe or ConvNets. Any Machine Learning technique runs this risk. In the end, all classifiers take a vector in some input space (usually very high-dimensional), which means they partition that input space. You've given examples of two partitions, which helps to estimate the boundary between the two, but only that boundary. Both partitions have very, very large boundaries, precisely because the input space is so high-dimensional.
ConvNets do try to tackle the high-dimensionality of image data by having fairly small convolution kernels. Realistic negative data helps in training those, and the label wouldn't really matter. You could even use the input image as goal (i.e. train it as an autoencoder) when training the convolution kernels.
One general reason why you don't want to lump all counterexamples is because they may be too varied. If you have a class A with some feature value from the range [-1,+1] on some scale, with counterexamples B [-2,-1] and C [+1,+2], lumping B and C together creates a range [-2,+2] for counterexamples which overlaps the real real range. Given enough data and powerful enough classifiers, this is not fatal, but for instance an SVM can fail badly on this.

4 fold cross validation | Caffe

So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.

Mahout - Naive Bayes Model Very Slow

I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Resources