Getting uncalibrated probability outputs with Vowpal Wabbit, ad-conversion prediction - probability

I'm trying to use Vowpal Wabbit to predict conversion rate for ads display and I'm getting non-intuitive probability outputs, which are centered at around 36% when the global frequency of the positive class is less than 1%.
The positive/negative imbalance I have in my dataset is 1/100 (I already undersampled the negative class), so I use a weight of 100 in the positive examples.
Negative examples have label -1, and positive ones 1. I used shuf to shuffle positive and negative examples for online learning to work properly.
Sample lines in the vw file:
1 100 'c4ac3440|i search_delay_log:3.58351893846 click_count_log:3.58351893846 banner_impression_count_log:3.98898404656 |c es i_type_2 xvertical_1_61 vertical_1 creat_size_728x90 retargeting
-1 1 'a4d25cf1|i search_delay_log:11.2825684591 click_count_log:11.2825684591 banner_impression_count_log:4.48863636973 |c br i_type_1 xvertical_1_960 vertical_1 creat_size_300x600 retargeting
Now I use the following to create a model from a training set:
vw -d impressions_rand.aa --loss_function logistic -c -k --passes 12 -f model.vw
Output:
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.510760 0.328374 2 2.0 -1.0000 -0.9449 11
0.387521 0.264282 4 4.0 -1.0000 -1.1825 11
1.765374 1.818883 8 107.0 1.0000 -1.7020 11
2.152669 2.444504 51 249.0 1.0000 -3.2953 11
1.289870 0.427071 201 498.0 -1.0000 -3.5498 11
0.878843 0.528943 588 1083.0 1.0000 -1.3394 9
0.852358 0.825872 1176 2166.0 -1.0000 -6.7918 11
0.871977 0.891597 2451 4332.0 -1.0000 -2.7031 11
0.689428 0.506878 4110 8664.0 -1.0000 -2.7525 11
0.638008 0.586589 8517 17328.0 -1.0000 -5.8017 11
0.580220 0.522713 17515 34741.0 1.0000 2.1519 11
0.526281 0.472343 35525 69482.0 -1.0000 -6.2931 9
0.497601 0.468921 71050 138964.0 -1.0000 -7.6245 9
0.479305 0.461008 143585 277928.0 -1.0000 -0.8296 11
0.443734 0.443734 288655 555856.0 -1.0000 -2.5795 11 h
0.438806 0.433925 578181 1111791.0 1.0000 0.8503 11 h
finished run
number of examples per pass = 216000
passes used = 5
weighted example sum = 2072475.000000
weighted label sum = -67475.000000
average loss = 0.432676 h
best constant = -0.065138
best constant's loss = 0.692617
total feature number = 11548690
Now to predict on a test set. The --link logistic should transform the vw outputs to probabilities in the range [0, 1].
vw -d impressions_rand.ab --link logistic -i model.vw -p preds_ab.txt
Output:
predictions = preds_ab.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
68.282379 68.282379 1 1.0 -1.0000 0.0001 9
38.748867 9.215355 2 2.0 -1.0000 0.0174 11
21.256140 3.763414 4 4.0 -1.0000 0.8345 11
11.685329 2.114518 8 8.0 -1.0000 0.3508 11
9.457854 7.230378 16 16.0 -1.0000 0.0069 11
7.371087 5.284320 32 32.0 -1.0000 0.3561 11
7.061980 6.752873 64 64.0 -1.0000 0.6549 11
5.423309 3.784638 128 128.0 -1.0000 0.2597 11
3.252394 1.725597 211 310.0 1.0000 0.7686 11
2.140099 1.052366 330 627.0 1.0000 0.7143 11
1.671550 1.203000 660 1254.0 -1.0000 0.8054 11
1.788466 1.905383 1320 2508.0 -1.0000 0.0676 9
1.508163 1.234410 2502 5076.0 1.0000 0.3921 11
1.282862 1.060063 5061 10209.0 1.0000 0.4258 9
1.119420 0.955977 11013 20418.0 -1.0000 0.6892 11
1.017911 0.916403 22323 40836.0 -1.0000 0.5301 9
0.888435 0.758960 42171 81672.0 -1.0000 0.3500 11
0.787709 0.686983 84243 163344.0 -1.0000 0.2360 9
0.703270 0.618831 170268 326688.0 -1.0000 0.5707 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 397936.000000
weighted label sum = -12936.000000
average loss = 0.684043
best constant = -0.032508
best constant's loss = 0.998943
total feature number = 2216941
This outputs me a predictions file preds_ab.txt like:
0.000095 7c14ae23
0.017367 3e9558bd
0.139393 6a1cd72f
0.834518 dfe76f6e
0.089810 2b88b547
If I calculate the ROC-AUC score of these predictions, I get a value of 0.85 which is close to what I get using scikit-learn (0.90). However the probability outputs are not calibrated at all, since they are much higher than what I would expect (close to 1%). This is the histogram.
This is the reliability curve:
And this is a plot of mean probabilities and positive frequencies when examples are binned by probabilities:
It's obvious that output probabilities are much higher than what would be expected from a well-calibrated classifier.
What am I doing wrong here? What should I investigate?
UPDATE
If I don't use a 100 weight for the positive class examples I get similar non-intuitive results. The mean probabity output is 0.27 (still very far from 1), the reliability plot looks even worse and ROC-AUC is 0.76.
I can confirm I have 237805 negative examples and 2195 positive ones.
Output training:
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.546724 0.400300 2 2.0 -1.0000 -0.7087 11
0.398553 0.250382 4 4.0 -1.0000 -1.3963 11
0.284506 0.170460 8 8.0 -1.0000 -2.2595 11
0.181406 0.078306 16 16.0 -1.0000 -2.8225 11
0.108136 0.034865 32 32.0 -1.0000 -4.2696 11
0.063156 0.018176 64 64.0 -1.0000 -4.7412 11
0.036415 0.009675 128 128.0 -1.0000 -4.2940 11
0.020325 0.004235 256 256.0 -1.0000 -5.9903 11
0.043248 0.066171 512 512.0 -1.0000 -5.5540 11
0.045276 0.047304 1024 1024.0 -1.0000 -4.7065 11
0.044606 0.043935 2048 2048.0 -1.0000 -6.6253 11
0.048938 0.053270 4096 4096.0 -1.0000 -5.9119 11
0.048711 0.048485 8192 8192.0 -1.0000 -2.3949 11
0.048157 0.047603 16384 16384.0 -1.0000 -9.6219 11
0.044306 0.040454 32768 32768.0 -1.0000 -8.8800 11
0.044029 0.043752 65536 65536.0 -1.0000 -5.9218 9
0.042739 0.041450 131072 131072.0 -1.0000 -3.8306 11
0.042986 0.042986 262144 262144.0 -1.0000 -6.0941 11 h
0.042321 0.041655 524288 524288.0 -1.0000 -4.0276 11 h
0.042654 0.042988 1048576 1048576.0 -1.0000 -9.9169 11 h
finished run
number of examples per pass = 216000
passes used = 7
weighted example sum = 1512000.000000
weighted label sum = -1484504.000000
average loss = 0.042763 h
best constant = -4.691161
best constant's loss = 0.051789
total feature number = 16166472
Output testing follows. I've read average loss being larger than best constant loss is an indicator of that something is wrong with my model learning.
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
78.141266 78.141266 1 1.0 -1.0000 0.0001 11
54.228148 30.315029 2 2.0 -1.0000 0.0015 11
33.279501 12.330854 4 4.0 1.0000 0.0472 11
20.358767 7.438034 8 8.0 -1.0000 0.0527 11
15.780043 11.201319 16 16.0 -1.0000 0.1657 11
13.783271 11.786498 32 32.0 -1.0000 0.0012 9
9.318714 4.854158 64 64.0 -1.0000 0.7268 11
6.797651 4.276587 128 128.0 -1.0000 0.1404 9
4.674237 2.550824 256 256.0 -1.0000 0.0516 11
3.269198 1.864159 512 512.0 -1.0000 0.4092 11
2.153033 1.036868 1024 1024.0 -1.0000 0.0425 11
1.481920 0.810807 2048 2048.0 -1.0000 0.2792 11
1.005869 0.529817 4096 4096.0 -1.0000 0.2422 11
0.676574 0.347279 8192 8192.0 -1.0000 0.3003 11
0.452924 0.229274 16384 16384.0 -1.0000 0.2579 11
0.295262 0.137600 32768 32768.0 -1.0000 0.2833 11
0.191513 0.087763 65536 65536.0 -1.0000 0.2616 9
0.126758 0.062003 131072 131072.0 -1.0000 0.2670 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.099565
best constant = -0.981009
best constant's loss = 0.037621
total feature number = 2217159

You say you have one positive example per 100 negative examples on average in the training set. However, you put 100 times more weight on the positive examples, which is (almost) equivalent to repeating each positive example 100 times in the training set. This way the average predicted probability should be around 50%. So you should not be surprised it is not around 1%.
According to the vw output you provided, it seems that there are more than 100 negative examples per one positive in the training set impressions_rand.aa, so the "weighted label sum" is negative (otherwise it should be around 0). Thus, the average predicted probability is not 50% but around 36%.

I solved it thanks to Martin Popel and arielf comments. :)
I forgot to use -t when generating the predictions.
I didn't specify --loss_function logisitc when generating the predictions.
As a result, the model was being updated while testing using the default loss function instead of the logistic one, destroying the model and producing wrong results.
Takeouts:
Use --loss_function logistic also during test to see correct loss outputs.
Remember to use -t if you don't want to update your model while predicting.
This is how the output looks now when testing (without example weighting):
$ vw -d impressions_rand.ab --link logistic --loss_function logistic -i model.vw -t -p preds.txt
only testing
predictions = preds.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.000053 0.000053 1 1.0 -1.0000 0.0001 11
0.000370 0.000687 2 2.0 -1.0000 0.0007 11
1.252868 2.505366 4 4.0 1.0000 0.0067 11
0.638249 0.023630 8 8.0 -1.0000 0.0036 11
0.322060 0.005872 16 16.0 -1.0000 0.0031 11
0.164750 0.007439 32 32.0 -1.0000 0.0000 9
0.084911 0.005072 64 64.0 -1.0000 0.0081 11
0.076905 0.068899 128 128.0 -1.0000 0.0004 9
0.055126 0.033347 256 256.0 -1.0000 0.0000 11
0.052986 0.050847 512 512.0 -1.0000 0.0133 11
0.038351 0.023715 1024 1024.0 -1.0000 0.0000 11
0.037059 0.035767 2048 2048.0 -1.0000 0.0167 11
0.038848 0.040637 4096 4096.0 -1.0000 0.0112 11
0.038903 0.038957 8192 8192.0 -1.0000 0.0281 11
0.041625 0.044348 16384 16384.0 -1.0000 0.0001 11
0.042526 0.043426 32768 32768.0 -1.0000 0.0218 11
0.042538 0.042551 65536 65536.0 -1.0000 0.0000 9
0.042150 0.041763 131072 131072.0 -1.0000 0.0019 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.042438
best constant = -4.647395
best constant's loss = 0.053670
total feature number = 2217159
You see now reported average loss is less than best constant's loss, and the iterative average losses lay also in the expected interval.
Also, the output probabilities now make perfect sense:

Related

Amazon QuickSight - Running Difference

I have the following table and want to add a column with running difference.
time_gap
amounts
0
150
0.5
19
1.5
2
6
1
7
4
my desired out is
time_gap
amounts
diff
0
150
150
0.5
19
131
1.5
2
129
6
10
119
7
4
115
What I've tried:
I duplicate the amounts column and used table calculation, difference, but got the difference between two consecutive rows instead:
time_gap
amounts
diff
0
150
0.5
19
-131
1.5
2
-17
6
10
8
7
4
-6
I tried some calculated fields formulas but that didn't work either.
thank you!

Error while using .cache file with vowpal wabbit

I am trying the examples given on vowpal-wabbit tutorial but I am getting an error while using *.cache file for training. Error: 6 is too many tokens for a simple label: 8.3.0c�?�p�k>���>���L=��O�?#
second_house�p�Q8>�ޙ�>�33�>��O�??
third_house�p�?��
V$ cat house_dataset
0 | price:.23 sqft:.25 age:.05 2006
1 2 'second_house | price:.18 sqft:.15 age:.35 1976
0 1 0.5 'third_house | price:.53 sqft:.32 age:.87 1924
V$ ls -lrth
total 4.0K
-rw-r--r-- 1 A users 144 May 3 06:28 house_dataset
V$ vw --version
8.3.0
V$ vw house_dataset -c
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = house_dataset.cache
Reading datafile = house_dataset
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.000000 0.000000 1 1.0 0.0000 0.0000 5
0.666667 1.000000 2 3.0 1.0000 0.0000 5
finished run
number of examples per pass = 4
passes used = 1
weighted example sum = 5.000000
weighted label sum = 2.000000
average loss = 0.600000
best constant = 0.500000
best constant's loss = 0.250000
total feature number = 16
V$ vw house_dataset.cache
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = house_dataset.cache
num sources = 1
average since example example current current current
loss last counter weight label predict features
Error: 6 is too many tokens for a simple label: 8.3.0c�?�p�k>���>���L=��O�?#
second_house�p�Q8>�ޙ�>�33�>��O�??
third_house�p�?��
0.000000 0.000000 1 1.0 unknown 0.0000 1
0.000000 0.000000 2 2.0 unknown 0.0000 1
finished run
number of examples per pass = 2
passes used = 1
weighted example sum = 2.000000
weighted label sum = 0.000000
average loss = 0.000000
total feature number = 2
It should be
$ vw --cache_file house_dataset.cache
You can check command line arguments description here.

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

MATLAB Greyscale 12 bit to 8 bit

I'm trying to create an algorithm to convert a greyscale from 12 bit to 8 bit.
I got a greyscale like this one:
The scale is represented in a Matrix. The problem is, that the simple multiplication with 1/16 destroys the first grey-columns.
Here the Codeexample:
in =[
1 1 1 3 3 3 15 15 15 63 63 63;
1 1 1 3 3 3 15 15 15 63 63 63;
1 1 1 3 3 3 15 15 15 63 63 63;
1 1 1 3 3 3 15 15 15 63 63 63
];
[zeilen spalten] = size(in);
eight = round(in/16);
imshow(uint8(eight));
Destroy mean, that the New Columns are Black now
Simply rescale the image so that you divide every single element by the maximum possible intensity that corresponds to a 12-bit (or 2^12 - 1 = 4095) unsigned integer and then multiply by the maximum possible intensity that corresponds to an 8-bit unsigned integer (or 2^8 - 1 = 255).
Therefore:
out = uint8((255.0/4095.0)*(double(in)));
You need to cast to double to ensure that you maintain floating point precision when performing this scaling, and then cast to uint8 so that the image type is ensured to be 8-bit. You have cleverly deduced that this scaling factor is roughly (1/16) (since 255.0/4095.0 ~ 1/16). However, the output of your test image will have its first 6 columns to surely be zero because intensities of 1 and 3 for a 12-bit image are just too small to be represented in its equivalent 8-bit form, which is why it gets rounded down to 0. If you think about it, for every 16 intensity increase that you have for your 12-bit image, this registers as an equivalent single intensity increase for an 8-bit image, or:
12-bit --> 8-bit
0 --> 0
15 --> 1
31 --> 2
47 --> 3
63 --> 4
... --> ...
4095 --> 255
Because your values of 1 and 3 are not high enough to get to the next level, these get rounded down to 0. However, your values of 15 get mapped to 1, and the values of 63 get mapped to 4, which is what we expect when you run the above code on your test input.

Precision, Recall and Correctly Classified Instances

How to calculate the Precision and Recall for yes, no class:
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.444 0.8 0.5 0.444 0.471 0.522 yes
0.2 0.556 0.167 0.2 0.182 0.522 no
Weighted Avg. 0.357 0.713 0.381 0.357 0.367 0.522
and Correctly Classified Instances is 35.714%
Data weather with Weka
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
No. outlook temperature humidity windy play
1 sunny 85.0 85.0 FALSE no
2 sunny 80.0 90.0 TRUE no
3 overcast 83.0 86.0 FALSE yes
4 rainy 70.0 96.0 FALSE yes
5 rainy 68.0 80.0 FALSE yes
6 rainy 65.0 70.0 TRUE no
7 overcast 64.0 65.0 TRUE yes
8 sunny 72.0 95.0 FALSE no
9 sunny 69.0 70.0 FALSE yes
10 rainy 75.0 80.0 FALSE yes
11 sunny 75.0 70.0 TRUE yes
12 overcast 72.0 90.0 TRUE yes
13 overcast 81.0 75.0 FALSE yes
14 rainy 71.0 91.0 TRUE no
===========================
=== Run information ===
Scheme:weka.classifiers.rules.PART -M 2 -C 0.25 -Q 1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
PART decision list
------------------
outlook = overcast: yes (4.0)
windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0)
: yes (3.0)
Number of Rules : 4
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 5 35.7143 %
Incorrectly Classified Instances 9 64.2857 %
Kappa statistic -0.3404
Mean absolute error 0.5518
Root mean squared error 0.6935
Relative absolute error 115.875 %
Root relative squared error 140.5649 %
Total Number of Instances 14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.444 0.8 0.5 0.444 0.471 0.522 yes
0.2 0.556 0.167 0.2 0.182 0.522 no
Weighted Avg. 0.357 0.713 0.381 0.357 0.367 0.522
=== Confusion Matrix ===
a b <-- classified as
4 5 | a = yes
4 1 | b = no
Thanks and best regards
From the confusion matrix:
=== Confusion Matrix ===
a b <-- classified as
4 5 | a = yes
4 1 | b = no
The Precision is computed as 4/8, i.e. the number of correctly classified a (yes) divided by the number of predicted a, while Recall is 4/9, the number of correctly classified a (yes) divided by the total number of true a. The precision and recall for the other class is the converse.
See the definitions of all those criteria in one single cheatsheet.

Resources