Precision, Recall and Correctly Classified Instances - precision

How to calculate the Precision and Recall for yes, no class:
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.444 0.8 0.5 0.444 0.471 0.522 yes
0.2 0.556 0.167 0.2 0.182 0.522 no
Weighted Avg. 0.357 0.713 0.381 0.357 0.367 0.522
and Correctly Classified Instances is 35.714%
Data weather with Weka
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
No. outlook temperature humidity windy play
1 sunny 85.0 85.0 FALSE no
2 sunny 80.0 90.0 TRUE no
3 overcast 83.0 86.0 FALSE yes
4 rainy 70.0 96.0 FALSE yes
5 rainy 68.0 80.0 FALSE yes
6 rainy 65.0 70.0 TRUE no
7 overcast 64.0 65.0 TRUE yes
8 sunny 72.0 95.0 FALSE no
9 sunny 69.0 70.0 FALSE yes
10 rainy 75.0 80.0 FALSE yes
11 sunny 75.0 70.0 TRUE yes
12 overcast 72.0 90.0 TRUE yes
13 overcast 81.0 75.0 FALSE yes
14 rainy 71.0 91.0 TRUE no
===========================
=== Run information ===
Scheme:weka.classifiers.rules.PART -M 2 -C 0.25 -Q 1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
PART decision list
------------------
outlook = overcast: yes (4.0)
windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0)
: yes (3.0)
Number of Rules : 4
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 5 35.7143 %
Incorrectly Classified Instances 9 64.2857 %
Kappa statistic -0.3404
Mean absolute error 0.5518
Root mean squared error 0.6935
Relative absolute error 115.875 %
Root relative squared error 140.5649 %
Total Number of Instances 14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.444 0.8 0.5 0.444 0.471 0.522 yes
0.2 0.556 0.167 0.2 0.182 0.522 no
Weighted Avg. 0.357 0.713 0.381 0.357 0.367 0.522
=== Confusion Matrix ===
a b <-- classified as
4 5 | a = yes
4 1 | b = no
Thanks and best regards

From the confusion matrix:
=== Confusion Matrix ===
a b <-- classified as
4 5 | a = yes
4 1 | b = no
The Precision is computed as 4/8, i.e. the number of correctly classified a (yes) divided by the number of predicted a, while Recall is 4/9, the number of correctly classified a (yes) divided by the total number of true a. The precision and recall for the other class is the converse.
See the definitions of all those criteria in one single cheatsheet.

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

Perfect scores in multiclassclassification?

I am working on a multiclass classification problem with 3 (1, 2, 3) classes being perfectly distributed. (70 instances of each class resulting in (210, 8) dataframe).
Now my data has all the 3 classes distributed in order i.e first 70 instances are class1, next 70 instances are class 2 and last 70 instances are class 3. I know that this kind of distribution will lead to good score on train set but poor score on test set as the test set has classes that the model has not seen. So I used stratify parameter in train_test_split. Below is my code:-
# SPLITTING
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state =
69, stratify = y)
cross_val_model = cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro')
s_score = cross_val_model.mean()
def objective(trial):
model__n_neighbors = trial.suggest_int('model__n_neighbors', 1, 20)
model__metric = trial.suggest_categorical('model__metric', ['euclidean', 'manhattan',
'minkowski'])
model__weights = trial.suggest_categorical('model__weights', ['uniform', 'distance'])
params = {'model__n_neighbors' : model__n_neighbors,
'model__metric' : model__metric,
'model__weights' : model__weights}
pipe.set_params(**params)
return np.mean( cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro'))
knn_study = optuna.create_study(direction = 'maximize')
knn_study.optimize(objective, n_trials = 10)
knn_study.best_params
optuna_gave_score = knn_study.best_value
pipe.set_params(**knn_study.best_params)
pipe.fit(train_x, train_y)
pred = pipe.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
c_report = classification_report(test_y, pred)
Now the problem is that I am getting perfect scores on everything. The f1 macro score from performing cv is 0.898. Below are my confusion matrix and classification report:-
14 0 0
0 14 0
0 0 14
Classification Report:-
precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 1.00 1.00 14
3 1.00 1.00 1.00 14
accuracy 1.00 42
macro avg 1.00 1.00 1.00 42
weighted avg 1.00 1.00 1.00 42
Am I overfitting or what?
Finally got the answer. The dataset I was using was the issue. The dataset was tailor made for knn algorithm and that was why I was getting perfect scores as I was using the same algorithm.
I got came to this conclusion after I performed a clustering exercise on this dataset and the K-Means algorithm perfectly predicted the clusters.

H2o: Is there a way to fix threshold in H2ORandomForestEstimator performance during training and testing?

I have built a model with H2ORandomForestEstimator and the results shows something like this below.
The threshold keeps changing (0.5 from traning and 0.313725489027 from validation) and I like to fix the threshold in H2ORandomForestEstimator for comparison during fine tuning. Is there a way to set the threshold?
From http://h2o-release.s3.amazonaws.com/h2o/master/3484/docs-website/h2o-py/docs/modeling.html#h2orandomforestestimator, there is no such parameter.
If there is no way to set this, how do we know what threshold our model is built on?
rf_v1
** Reported on train data. **
MSE: 2.75013548238e-05
RMSE: 0.00524417341664
LogLoss:0.000494320913199
Mean Per-Class Error: 0.0188802936476
AUC: 0.974221763605
Gini: 0.948443527211
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.5:
0 1 Error Rate
----- ------ --- ------- --------------
0 161692 1 0 (1.0/161693.0)
1 3 50 0.0566 (3.0/53.0)
Total 161695 51 0 (4.0/161746.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.5 0.961538 19
max f2 0.25 0.955056 21
max f0point5 0.571429 0.983936 18
max accuracy 0.571429 0.999975 18
max precision 1 1 0
max recall 0 1 69
max specificity 1 1 0
max absolute_mcc 0.5 0.961704 19
max min_per_class_accuracy 0.25 0.962264 21
max mean_per_class_accuracy 0.25 0.98112 21
Gains/Lift Table: Avg response rate: 0.03 %
** Reported on validation data. **
MSE: 1.00535766226e-05
RMSE: 0.00317073755183
LogLoss: 4.53885183426e-05
Mean Per-Class Error: 0.0
AUC: 1.0
Gini: 1.0
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.313725489027:
0 1 Error Rate
----- ----- --- ------- -------------
0 53715 0 0 (0.0/53715.0)
1 0 16 0 (0.0/16.0)
Total 53715 16 0 (0.0/53731.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- ------- -----
max f1 0.313725 1 5
max f2 0.313725 1 5
max f0point5 0.313725 1 5
max accuracy 0.313725 1 5
max precision 1 1 0
max recall 0.313725 1 5
max specificity 1 1 0
max absolute_mcc 0.313725 1 5
max min_per_class_accuracy 0.313725 1 5
max mean_per_class_accuracy 0.313725 1 5
The threshold is max-F1.
If you want to apply your own threshold, you will have to take the probability of the positive class and compare it yourself to produce the label you want.
If you use your web browser to connect to the H2O Flow Web UI inside of H2O-3, you can mouse over the ROC curve and visually browse the confusion matrix for each threshold, which is convenient.

Getting uncalibrated probability outputs with Vowpal Wabbit, ad-conversion prediction

I'm trying to use Vowpal Wabbit to predict conversion rate for ads display and I'm getting non-intuitive probability outputs, which are centered at around 36% when the global frequency of the positive class is less than 1%.
The positive/negative imbalance I have in my dataset is 1/100 (I already undersampled the negative class), so I use a weight of 100 in the positive examples.
Negative examples have label -1, and positive ones 1. I used shuf to shuffle positive and negative examples for online learning to work properly.
Sample lines in the vw file:
1 100 'c4ac3440|i search_delay_log:3.58351893846 click_count_log:3.58351893846 banner_impression_count_log:3.98898404656 |c es i_type_2 xvertical_1_61 vertical_1 creat_size_728x90 retargeting
-1 1 'a4d25cf1|i search_delay_log:11.2825684591 click_count_log:11.2825684591 banner_impression_count_log:4.48863636973 |c br i_type_1 xvertical_1_960 vertical_1 creat_size_300x600 retargeting
Now I use the following to create a model from a training set:
vw -d impressions_rand.aa --loss_function logistic -c -k --passes 12 -f model.vw
Output:
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.510760 0.328374 2 2.0 -1.0000 -0.9449 11
0.387521 0.264282 4 4.0 -1.0000 -1.1825 11
1.765374 1.818883 8 107.0 1.0000 -1.7020 11
2.152669 2.444504 51 249.0 1.0000 -3.2953 11
1.289870 0.427071 201 498.0 -1.0000 -3.5498 11
0.878843 0.528943 588 1083.0 1.0000 -1.3394 9
0.852358 0.825872 1176 2166.0 -1.0000 -6.7918 11
0.871977 0.891597 2451 4332.0 -1.0000 -2.7031 11
0.689428 0.506878 4110 8664.0 -1.0000 -2.7525 11
0.638008 0.586589 8517 17328.0 -1.0000 -5.8017 11
0.580220 0.522713 17515 34741.0 1.0000 2.1519 11
0.526281 0.472343 35525 69482.0 -1.0000 -6.2931 9
0.497601 0.468921 71050 138964.0 -1.0000 -7.6245 9
0.479305 0.461008 143585 277928.0 -1.0000 -0.8296 11
0.443734 0.443734 288655 555856.0 -1.0000 -2.5795 11 h
0.438806 0.433925 578181 1111791.0 1.0000 0.8503 11 h
finished run
number of examples per pass = 216000
passes used = 5
weighted example sum = 2072475.000000
weighted label sum = -67475.000000
average loss = 0.432676 h
best constant = -0.065138
best constant's loss = 0.692617
total feature number = 11548690
Now to predict on a test set. The --link logistic should transform the vw outputs to probabilities in the range [0, 1].
vw -d impressions_rand.ab --link logistic -i model.vw -p preds_ab.txt
Output:
predictions = preds_ab.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
68.282379 68.282379 1 1.0 -1.0000 0.0001 9
38.748867 9.215355 2 2.0 -1.0000 0.0174 11
21.256140 3.763414 4 4.0 -1.0000 0.8345 11
11.685329 2.114518 8 8.0 -1.0000 0.3508 11
9.457854 7.230378 16 16.0 -1.0000 0.0069 11
7.371087 5.284320 32 32.0 -1.0000 0.3561 11
7.061980 6.752873 64 64.0 -1.0000 0.6549 11
5.423309 3.784638 128 128.0 -1.0000 0.2597 11
3.252394 1.725597 211 310.0 1.0000 0.7686 11
2.140099 1.052366 330 627.0 1.0000 0.7143 11
1.671550 1.203000 660 1254.0 -1.0000 0.8054 11
1.788466 1.905383 1320 2508.0 -1.0000 0.0676 9
1.508163 1.234410 2502 5076.0 1.0000 0.3921 11
1.282862 1.060063 5061 10209.0 1.0000 0.4258 9
1.119420 0.955977 11013 20418.0 -1.0000 0.6892 11
1.017911 0.916403 22323 40836.0 -1.0000 0.5301 9
0.888435 0.758960 42171 81672.0 -1.0000 0.3500 11
0.787709 0.686983 84243 163344.0 -1.0000 0.2360 9
0.703270 0.618831 170268 326688.0 -1.0000 0.5707 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 397936.000000
weighted label sum = -12936.000000
average loss = 0.684043
best constant = -0.032508
best constant's loss = 0.998943
total feature number = 2216941
This outputs me a predictions file preds_ab.txt like:
0.000095 7c14ae23
0.017367 3e9558bd
0.139393 6a1cd72f
0.834518 dfe76f6e
0.089810 2b88b547
If I calculate the ROC-AUC score of these predictions, I get a value of 0.85 which is close to what I get using scikit-learn (0.90). However the probability outputs are not calibrated at all, since they are much higher than what I would expect (close to 1%). This is the histogram.
This is the reliability curve:
And this is a plot of mean probabilities and positive frequencies when examples are binned by probabilities:
It's obvious that output probabilities are much higher than what would be expected from a well-calibrated classifier.
What am I doing wrong here? What should I investigate?
UPDATE
If I don't use a 100 weight for the positive class examples I get similar non-intuitive results. The mean probabity output is 0.27 (still very far from 1), the reliability plot looks even worse and ROC-AUC is 0.76.
I can confirm I have 237805 negative examples and 2195 positive ones.
Output training:
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.546724 0.400300 2 2.0 -1.0000 -0.7087 11
0.398553 0.250382 4 4.0 -1.0000 -1.3963 11
0.284506 0.170460 8 8.0 -1.0000 -2.2595 11
0.181406 0.078306 16 16.0 -1.0000 -2.8225 11
0.108136 0.034865 32 32.0 -1.0000 -4.2696 11
0.063156 0.018176 64 64.0 -1.0000 -4.7412 11
0.036415 0.009675 128 128.0 -1.0000 -4.2940 11
0.020325 0.004235 256 256.0 -1.0000 -5.9903 11
0.043248 0.066171 512 512.0 -1.0000 -5.5540 11
0.045276 0.047304 1024 1024.0 -1.0000 -4.7065 11
0.044606 0.043935 2048 2048.0 -1.0000 -6.6253 11
0.048938 0.053270 4096 4096.0 -1.0000 -5.9119 11
0.048711 0.048485 8192 8192.0 -1.0000 -2.3949 11
0.048157 0.047603 16384 16384.0 -1.0000 -9.6219 11
0.044306 0.040454 32768 32768.0 -1.0000 -8.8800 11
0.044029 0.043752 65536 65536.0 -1.0000 -5.9218 9
0.042739 0.041450 131072 131072.0 -1.0000 -3.8306 11
0.042986 0.042986 262144 262144.0 -1.0000 -6.0941 11 h
0.042321 0.041655 524288 524288.0 -1.0000 -4.0276 11 h
0.042654 0.042988 1048576 1048576.0 -1.0000 -9.9169 11 h
finished run
number of examples per pass = 216000
passes used = 7
weighted example sum = 1512000.000000
weighted label sum = -1484504.000000
average loss = 0.042763 h
best constant = -4.691161
best constant's loss = 0.051789
total feature number = 16166472
Output testing follows. I've read average loss being larger than best constant loss is an indicator of that something is wrong with my model learning.
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
78.141266 78.141266 1 1.0 -1.0000 0.0001 11
54.228148 30.315029 2 2.0 -1.0000 0.0015 11
33.279501 12.330854 4 4.0 1.0000 0.0472 11
20.358767 7.438034 8 8.0 -1.0000 0.0527 11
15.780043 11.201319 16 16.0 -1.0000 0.1657 11
13.783271 11.786498 32 32.0 -1.0000 0.0012 9
9.318714 4.854158 64 64.0 -1.0000 0.7268 11
6.797651 4.276587 128 128.0 -1.0000 0.1404 9
4.674237 2.550824 256 256.0 -1.0000 0.0516 11
3.269198 1.864159 512 512.0 -1.0000 0.4092 11
2.153033 1.036868 1024 1024.0 -1.0000 0.0425 11
1.481920 0.810807 2048 2048.0 -1.0000 0.2792 11
1.005869 0.529817 4096 4096.0 -1.0000 0.2422 11
0.676574 0.347279 8192 8192.0 -1.0000 0.3003 11
0.452924 0.229274 16384 16384.0 -1.0000 0.2579 11
0.295262 0.137600 32768 32768.0 -1.0000 0.2833 11
0.191513 0.087763 65536 65536.0 -1.0000 0.2616 9
0.126758 0.062003 131072 131072.0 -1.0000 0.2670 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.099565
best constant = -0.981009
best constant's loss = 0.037621
total feature number = 2217159
You say you have one positive example per 100 negative examples on average in the training set. However, you put 100 times more weight on the positive examples, which is (almost) equivalent to repeating each positive example 100 times in the training set. This way the average predicted probability should be around 50%. So you should not be surprised it is not around 1%.
According to the vw output you provided, it seems that there are more than 100 negative examples per one positive in the training set impressions_rand.aa, so the "weighted label sum" is negative (otherwise it should be around 0). Thus, the average predicted probability is not 50% but around 36%.
I solved it thanks to Martin Popel and arielf comments. :)
I forgot to use -t when generating the predictions.
I didn't specify --loss_function logisitc when generating the predictions.
As a result, the model was being updated while testing using the default loss function instead of the logistic one, destroying the model and producing wrong results.
Takeouts:
Use --loss_function logistic also during test to see correct loss outputs.
Remember to use -t if you don't want to update your model while predicting.
This is how the output looks now when testing (without example weighting):
$ vw -d impressions_rand.ab --link logistic --loss_function logistic -i model.vw -t -p preds.txt
only testing
predictions = preds.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.000053 0.000053 1 1.0 -1.0000 0.0001 11
0.000370 0.000687 2 2.0 -1.0000 0.0007 11
1.252868 2.505366 4 4.0 1.0000 0.0067 11
0.638249 0.023630 8 8.0 -1.0000 0.0036 11
0.322060 0.005872 16 16.0 -1.0000 0.0031 11
0.164750 0.007439 32 32.0 -1.0000 0.0000 9
0.084911 0.005072 64 64.0 -1.0000 0.0081 11
0.076905 0.068899 128 128.0 -1.0000 0.0004 9
0.055126 0.033347 256 256.0 -1.0000 0.0000 11
0.052986 0.050847 512 512.0 -1.0000 0.0133 11
0.038351 0.023715 1024 1024.0 -1.0000 0.0000 11
0.037059 0.035767 2048 2048.0 -1.0000 0.0167 11
0.038848 0.040637 4096 4096.0 -1.0000 0.0112 11
0.038903 0.038957 8192 8192.0 -1.0000 0.0281 11
0.041625 0.044348 16384 16384.0 -1.0000 0.0001 11
0.042526 0.043426 32768 32768.0 -1.0000 0.0218 11
0.042538 0.042551 65536 65536.0 -1.0000 0.0000 9
0.042150 0.041763 131072 131072.0 -1.0000 0.0019 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.042438
best constant = -4.647395
best constant's loss = 0.053670
total feature number = 2217159
You see now reported average loss is less than best constant's loss, and the iterative average losses lay also in the expected interval.
Also, the output probabilities now make perfect sense:

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

Resources