How can I get the appropriate metric (accuracy, F1 etc.) for each label?
I use the trainer from Transformers.
I would like to have an output similar to the sklearn.metrics.classification_report
Thanks for your help!
You can print the sklear classification report during the training phase, by adjusting the compute_metrics() function and pass it to the trainer. For a little demo you can change the function in the official huggingface example to the following:
from sklearn.metrics import classification_report
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
predictions = predictions[:, 0]
print(classification_report(labels, predictions))
return metric.compute(predictions=predictions, references=labels)
After each epoch you get the following output:
precision recall f1-score support
0 0.76 0.36 0.49 322
1 0.77 0.95 0.85 721
accuracy 0.77 1043
macro avg 0.77 0.66 0.67 1043
weighted avg 0.77 0.77 0.74 1043
For a more fine grained control during your training phase, you can also define callback to customise the behaviour of the training loop during different states.
class PrintClassificationCallback(TrainerCallback):
def on_evaluate(self, args, state, control, logs=None, **kwargs):
print("Called after evaluation phase")
trainer = Trainer(
After your training phase you can also use your trained model in a classification pipeline to pass one or more samples to your model and get the corresponding prediction labels. For example
from transformers import pipeline
from sklearn.metrics import classification_report
text_classification_pipeline = pipeline("text-classification", model="MyFinetunedModel")
X = [ "This is a cat sentence", "This is a dog sentence", "This is a fish sentence"]
y_act = ["LABEL_1", "LABEL_2", "LABEL_3"]
labels = ["LABEL_1", "LABEL_2", "LABEL_3"]
y_pred = [result["label"] for result in text_classification_pipeline(X)]
print(classification_report(y_pred, y_act, labels=labels))
precision recall f1-score support
LABEL_1 1.00 0.33 0.50 3
LABEL_2 0.00 0.00 0.00 0
LABEL_3 0.00 0.00 0.00 0
accuracy 0.33 3
macro avg 0.33 0.11 0.17 3
weighted avg 1.00 0.33 0.50 3
Hope it helps.
I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
I downloaded Stanford NLP 3.5.2 and run sentiment analysis with default configuration (i.e. I did not change anything, just unzip and run).
java -cp "*" edu.stanford.nlp.sentiment.Evaluate -model edu/stanford/nlp/models/sentiment/sentiment.ser.gz -treebank test.txt
Tested 82600 labels
66258 correct
16342 incorrect
0.802155 accuracy
Tested 2210 roots
976 correct
1234 incorrect
0.441629 accuracy
Label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 323 161 27 3 3 517
1 1294 5498 2245 652 148 9837
2 292 2993 51972 2868 282 58407
3 99 602 2283 7247 2140 12371
4 0 1 21 228 1218 1468
Marg. (Gold) 2008 9255 56548 10998 3791
0 prec=0.62476, recall=0.16086, spec=0.99759, f1=0.25584
1 prec=0.55891, recall=0.59406, spec=0.94084, f1=0.57595
2 prec=0.88982, recall=0.91908, spec=0.75299, f1=0.90421
3 prec=0.58581, recall=0.65894, spec=0.92844, f1=0.62022
4 prec=0.8297, recall=0.32129, spec=0.99683, f1=0.46321
Root label confusion matrix
Guess/Gold 0 1 2 3 4 Marg. (Guess)
0 44 39 9 0 0 92
1 193 451 190 131 36 1001
2 23 62 82 30 8 205
3 19 81 101 299 255 755
4 0 0 7 50 100 157
Marg. (Gold) 279 633 389 510 399
0 prec=0.47826, recall=0.15771, spec=0.97514, f1=0.2372
1 prec=0.45055, recall=0.71248, spec=0.65124, f1=0.55202
2 prec=0.4, recall=0.2108, spec=0.93245, f1=0.27609
3 prec=0.39603, recall=0.58627, spec=0.73176, f1=0.47273
4 prec=0.63694, recall=0.25063, spec=0.96853, f1=0.35971
Approximate Negative label accuracy: 0.646009
Approximate Positive label accuracy: 0.732504
Combined approximate label accuracy: 0.695110
Approximate Negative root label accuracy: 0.797149
Approximate Positive root label accuracy: 0.774477
Combined approximate root label accuracy: 0.785832
The test.txt file is downloaded from (contains train.txt, dev.txt and test.txt). The download link is get from
However, in the paper "Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C., 2013, October. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (Vol. 1631, p. 1642)." which sentiment analysis tool is based on, the authors reported that the accuracy when classify 5 classes is 0.807.
Is my results I obtained normal?
I get the same results when I run it out of the box. It would not surprise me if the version of their system they made for Stanford CoreNLP differs slightly from the version in the paper.
I created a pcolor image with each grid shaded in based on a value in the matrix C.
h1 = pcolor(C);
h = colorbar;
ylabel(h,'Monthly Correlation (r-value)');
shading flat
Each grid corresponds to a particular year on the x axes and a particular site name on the y axes. How can I add an axes label to show this?
I tried the following but it didn't do anything. Plus, I'd like to put the label in the middle of each grid, not on the edges.
x axes labels: years' looks like this (size 15x1 double)
y axes labels: a looks like this (12x1 cell):
Current image looks like this:
You are using the wrong handle. For setting labels you need the axes handle and not the pcolor-handle:
%// get axes handle
ax = gca;
%// set labels
%// example data
C = [...
0.06 -0.22 -0.10 0.68 NaN -0.33;
0.04 -0.07 0.12 0.23 NaN -0.47;
NaN NaN NaN NaN NaN 0.28;
0.37 0.36 0.14 0.58 -0.14 -0.15;
NaN 0.11 0.24 0.71 -0.13 NaN;
0.57 0.53 0.41 0.65 -0.43 0.03 ];
%// original plot
h1 = pcolor(C);
h = colorbar;
ylabel(h,'Monthly Correlation (r-value)');
shading flat
%// get axes handle
ax = gca;
%// labels (shortened to fit data)
years = [1999, 2000, 2001, 2002, 2003, 2004];
a = {'09-003-1003-88101', '09-009-0027-88101', '25-013-0008-88101', ...
'25-025-0042-88101', '33-005-0007-88101', '33-009-0010-88101'};
%// adjust position of ticks
set(ax,'XTick', (1:size(C,2))+0.5 )
set(ax,'YTick', (1:size(C,1))+0.5 )
%// set labels
I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.
If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.
I have put down a quick testing framework for the solutions proposed so far.
sort.q <- function(m) {
sort(m, method='quick')
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
b <- benchmark(apply.sort(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.
R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03
They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)