How to calculate AUC when processing with batch in deep learning - metrics

I have a bit of a problem when calculating the AUC score.
How should I perform the AUC score calculation when I do the batch evaluation in the validation phase?
I am wondering about the following options:
Calculate the AUC score for each batch, the final result is the average of the scores of the batches.
Given the logits of each batch, calculate the probability of that batch (using softmax function) -> Concat all the probability -> Calculate the AUC score.
Concat all the logits of all batches -> Calculate the probability only once (using softmax function) -> Calculate the AUC score.
How should I do it?
Thank you.

Related

How to calculate the standard errors of ERGM predicted probabilities?

I'm having trouble estimating the standard errors from the predicted probabilities from a ERGM model, to calculate a confidence interval. Getting the predicted probabilities is not a problem, but I want to get a sense of the uncertainty surrounding the predictions.
Below is a reproduceable example based on the data set of marriage and business ties among Renaissance Florentine families.
library(statnet)
data(flo)
flomarriage <- network(flo,directed=FALSE)
flomarriage
flomarriage %v% "wealth" <- c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
flomarriage
gest <- ergm(flomarriage ~ edges +
absdiff("wealth"))
summary(gest)
plogis(coef(gest)[['edges']] + coef(gest)[['absdiff.wealth']]*10)
Based on the model, it is estimated that a wealth difference of 10 corresponds to a 0.182511 probability of a tie. My first question is, is this a correct interpretation? And my second question is, how does one calculate the standard error of this probability?
This is the correct interpretation for a dyad-independent model such as this one. For a dyad-dependent model, it would be the conditional probability given the rest of the network.
You can obtain the standard error of the prediction on the logit scale by rewriting your last line as a dot product of a weight vector and the coefficient vector:
eta <- sum(c(1,10)*coef(gest))
plogis(eta)
then, since vcov(gest) is the covariance matrix of parameter estimates, and using the variance formulas (e.g., http://www.stat.cmu.edu/~larry/=stat401/lecture-13.pdf),
(var.eta <- c(1,10)%*%vcov(gest)%*%c(1,10))
You can then get the variance (and the standard error) of the predicted probability using the Delta Method (e.g., https://blog.methodsconsultants.com/posts/delta-method-standard-errors/). However, unless you require the standard error as such, as opposed to a confidence interval, my advice would be to first calculate the interval for eta (i.e., eta + qnorm(c(0.025,0.0975))*sqrt(var.eta)), then call plogis() on the endpoints.
Really very helpful answer! Many thanks for it.
Here's just a small, almost cosmetic note about the suggested formula: 1) the upper bound in qnorm should be 0.975, not 0.0975, and 2) use matrix multiplication before sqrt --> eta + qnorm(c(0.025,0.975))%*%sqrt(var.eta).

Multiclass classification confidence score using multiclass classification using predict_proba of SGDclassifier

I am using Logistic regression in SGDClassifier to perform multi-class classification of ~10k category.
To get confidence score for predicted result I am using predict_proba function.
But I am getting prediction probability value of 0.00026091,0.00049697,0.00019632 for both correct and wrong prediction.
Please suggest the way to normalize the score so that I can consider result by filtering the probability value
If the probability values of all classes are very low, it might mean, that your classifier has a hard time to classify the samples. You might want to do some feature engineering or try another model.
To normalize the values, have a look at scikit-learn MinMaxScaler. This will scale the data to numbers between 0 and 1. But as I said, if the probability for all values is very low, you wont get a good classification result.
Hope that helps

Large Residual-Online Outlier Detection for Kalman Filter

I am trying to find outliers in Residual. I used three algorithms basically if the residuals magnitudes are less, the algorithm performances are good but if the residuals magnitude are big, the algorithm performances are not good.
1) 𝑿^𝟐=〖(𝒚−𝒉(𝒙))〗^𝑻 𝑺^(−𝟏) (𝒚−𝒉(𝒙)) - Chi-Square Test
if the matrix 3x3 - degree of freedom is 4.
𝑿^𝟐 > 13.277
2) Residual(i) > 3√(HP 𝐻^𝑇 + R) - Measurement Covariance Noise
3) Residual(i) > 3-Sigma
I have applied three algorithms to find the outliers. First one is Chi Square Test, second checks Measurement Covariance Noise, Third looks the 3 sigma.
Can you give any suggestion about the algorithms or I can implement a new way if you suggest?
The third case cannot be correct for all case because if there is a large residual, will fail. The second one is more stable because it is related to measurement noise covariance so that your residual should change according to the measurement covariance error.

ROC for predictions - how to define class labels

I have a set of predictions from a model, and a set of true values of the observations, and I want to create an ROC.
The quality of the prediction (in absolute error terms) is independent of the magnitude of the prediction. So I have a set of predictions (pred(1), pred(2), ..., pred(n)) and observations (obs(1), obs(2), ..., obs(n)).
Someone told me to create the elements of my binary classification vector label(i) as label(i) = ifelse(|obs(i) - pred(i)| < tol, 1, 0) and then calculate the AUC (tol is some respecified tolerance). So for each prediction, if it is close to the corresponding observation, the corresponding label is 1, otherwise it is 0.
But I don't see how the suggested labeling is valid, as higher pred() values will not necessarily discriminate my binary classification, i.e. prediction values do not serve to "RANK" the quality of my predictions (i.e., a given threshold does not divide my data naturally). Can someone please shed some light for me on what to do here? Is the suggestion given above a valid one? Or is an ROC inappropriate to use here?
ROC analysis is defined for binary classification, where the observed labels can take two values (binary), and your predictions are any sort of numbers. There are extensions of ROC analysis to multi-class classification, but your question suggests that your observations are some sort of continuous measurement. You could binarize them (something like label(i) = ifelse(obs(i) > someValue, 1, 0)), but it would be invalid for the labels to depend on the classification: they must be some sort of truth that is independent on your classifier.
Alternatively if your observations are continuous, you should assess the quality of your predictions with a coefficient of correlation or a similar measure.

What does a Bayesian Classifier score represent?

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Resources