Multiclass classification confidence score using multiclass classification using predict_proba of SGDclassifier - probability

I am using Logistic regression in SGDClassifier to perform multi-class classification of ~10k category.
To get confidence score for predicted result I am using predict_proba function.
But I am getting prediction probability value of 0.00026091,0.00049697,0.00019632 for both correct and wrong prediction.
Please suggest the way to normalize the score so that I can consider result by filtering the probability value

If the probability values of all classes are very low, it might mean, that your classifier has a hard time to classify the samples. You might want to do some feature engineering or try another model.
To normalize the values, have a look at scikit-learn MinMaxScaler. This will scale the data to numbers between 0 and 1. But as I said, if the probability for all values is very low, you wont get a good classification result.
Hope that helps

Related

How can I measure the probability of error of a trained model, in particular the random forest?

To do the binary classification of a set of images, I trained the random forest on a set of data.
I now want to evaluate the error probability of my model.
For that, I did two things and I don't know what corresponds to this error probability:
I calculated the accuracy using the k-fold cross validation
I tested my model after I calculated the ratio between the misclassified images and the total number of images.
What is the correct way to calculate the probability of error for my trained model?

Typical scoring parameter choices for cross-validation of ranking classifier rank:pairwise

I am building XGBoost ranking classifier using Python xgboost.sklearn.XGBClassifier (XGBClassifier). In my problem, I try to classify ranking labels that vary in 0,1,2,3. In the classifier setup, I used objective = "rank:pairwise". I now want to run cross-validation with sklearn.model_selection.cross_val_score (cross_val_score).
Are there any canonical choices of scoring function to assess the rank outcome classification performance?
I am thinking scoring = "neg_mean_squared_error" seems like an OK choice as it weights the distance between the two labels, i.e. accounts for the ranking character of the outcome.
I hope to get other comments/opinions/experiences on that.

How does a bagging classifier(average) works?

how does a bagging classifier works(averaging, not voting)?. I am working on bagging classifier, I want to use a average of the models but when I bag models, the result is a continuous value rather than a categorical value. Can I use averaging here? If yes, How?
You have to give more details on what programming language and library you are using,
If you are doing regression the bagging model can give you the average or a weighted average.
If you are doing classification then it can be voting or weighted voting.
However, if you are doing binary classification then the average of 1s and 0s can be used to give you some pseudo probability or confidence for the prediction.
you can do this for non-binary classification using the one vs all method to get probabilities for all possible classes.

ROC for predictions - how to define class labels

I have a set of predictions from a model, and a set of true values of the observations, and I want to create an ROC.
The quality of the prediction (in absolute error terms) is independent of the magnitude of the prediction. So I have a set of predictions (pred(1), pred(2), ..., pred(n)) and observations (obs(1), obs(2), ..., obs(n)).
Someone told me to create the elements of my binary classification vector label(i) as label(i) = ifelse(|obs(i) - pred(i)| < tol, 1, 0) and then calculate the AUC (tol is some respecified tolerance). So for each prediction, if it is close to the corresponding observation, the corresponding label is 1, otherwise it is 0.
But I don't see how the suggested labeling is valid, as higher pred() values will not necessarily discriminate my binary classification, i.e. prediction values do not serve to "RANK" the quality of my predictions (i.e., a given threshold does not divide my data naturally). Can someone please shed some light for me on what to do here? Is the suggestion given above a valid one? Or is an ROC inappropriate to use here?
ROC analysis is defined for binary classification, where the observed labels can take two values (binary), and your predictions are any sort of numbers. There are extensions of ROC analysis to multi-class classification, but your question suggests that your observations are some sort of continuous measurement. You could binarize them (something like label(i) = ifelse(obs(i) > someValue, 1, 0)), but it would be invalid for the labels to depend on the classification: they must be some sort of truth that is independent on your classifier.
Alternatively if your observations are continuous, you should assess the quality of your predictions with a coefficient of correlation or a similar measure.

What does a Bayesian Classifier score represent?

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Resources