Error while using .cache file with vowpal wabbit - vowpalwabbit

I am trying the examples given on vowpal-wabbit tutorial but I am getting an error while using *.cache file for training. Error: 6 is too many tokens for a simple label: 8.3.0c�?�p�k>���>���L=��O�?#
second_house�p�Q8>�ޙ�>�33�>��O�??
third_house�p�?��
V$ cat house_dataset
0 | price:.23 sqft:.25 age:.05 2006
1 2 'second_house | price:.18 sqft:.15 age:.35 1976
0 1 0.5 'third_house | price:.53 sqft:.32 age:.87 1924
V$ ls -lrth
total 4.0K
-rw-r--r-- 1 A users 144 May 3 06:28 house_dataset
V$ vw --version
8.3.0
V$ vw house_dataset -c
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = house_dataset.cache
Reading datafile = house_dataset
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.000000 0.000000 1 1.0 0.0000 0.0000 5
0.666667 1.000000 2 3.0 1.0000 0.0000 5
finished run
number of examples per pass = 4
passes used = 1
weighted example sum = 5.000000
weighted label sum = 2.000000
average loss = 0.600000
best constant = 0.500000
best constant's loss = 0.250000
total feature number = 16
V$ vw house_dataset.cache
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = house_dataset.cache
num sources = 1
average since example example current current current
loss last counter weight label predict features
Error: 6 is too many tokens for a simple label: 8.3.0c�?�p�k>���>���L=��O�?#
second_house�p�Q8>�ޙ�>�33�>��O�??
third_house�p�?��
0.000000 0.000000 1 1.0 unknown 0.0000 1
0.000000 0.000000 2 2.0 unknown 0.0000 1
finished run
number of examples per pass = 2
passes used = 1
weighted example sum = 2.000000
weighted label sum = 0.000000
average loss = 0.000000
total feature number = 2

It should be
$ vw --cache_file house_dataset.cache
You can check command line arguments description here.

Related

H2o: Is there a way to fix threshold in H2ORandomForestEstimator performance during training and testing?

I have built a model with H2ORandomForestEstimator and the results shows something like this below.
The threshold keeps changing (0.5 from traning and 0.313725489027 from validation) and I like to fix the threshold in H2ORandomForestEstimator for comparison during fine tuning. Is there a way to set the threshold?
From http://h2o-release.s3.amazonaws.com/h2o/master/3484/docs-website/h2o-py/docs/modeling.html#h2orandomforestestimator, there is no such parameter.
If there is no way to set this, how do we know what threshold our model is built on?
rf_v1
** Reported on train data. **
MSE: 2.75013548238e-05
RMSE: 0.00524417341664
LogLoss:0.000494320913199
Mean Per-Class Error: 0.0188802936476
AUC: 0.974221763605
Gini: 0.948443527211
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.5:
0 1 Error Rate
----- ------ --- ------- --------------
0 161692 1 0 (1.0/161693.0)
1 3 50 0.0566 (3.0/53.0)
Total 161695 51 0 (4.0/161746.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.5 0.961538 19
max f2 0.25 0.955056 21
max f0point5 0.571429 0.983936 18
max accuracy 0.571429 0.999975 18
max precision 1 1 0
max recall 0 1 69
max specificity 1 1 0
max absolute_mcc 0.5 0.961704 19
max min_per_class_accuracy 0.25 0.962264 21
max mean_per_class_accuracy 0.25 0.98112 21
Gains/Lift Table: Avg response rate: 0.03 %
** Reported on validation data. **
MSE: 1.00535766226e-05
RMSE: 0.00317073755183
LogLoss: 4.53885183426e-05
Mean Per-Class Error: 0.0
AUC: 1.0
Gini: 1.0
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.313725489027:
0 1 Error Rate
----- ----- --- ------- -------------
0 53715 0 0 (0.0/53715.0)
1 0 16 0 (0.0/16.0)
Total 53715 16 0 (0.0/53731.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- ------- -----
max f1 0.313725 1 5
max f2 0.313725 1 5
max f0point5 0.313725 1 5
max accuracy 0.313725 1 5
max precision 1 1 0
max recall 0.313725 1 5
max specificity 1 1 0
max absolute_mcc 0.313725 1 5
max min_per_class_accuracy 0.313725 1 5
max mean_per_class_accuracy 0.313725 1 5
The threshold is max-F1.
If you want to apply your own threshold, you will have to take the probability of the positive class and compare it yourself to produce the label you want.
If you use your web browser to connect to the H2O Flow Web UI inside of H2O-3, you can mouse over the ROC curve and visually browse the confusion matrix for each threshold, which is convenient.

deeplearning and deepwater models give very different logloss (0.4 vs 0.6)

In AWS, I followed the instruction in here and launched a g2.2xlarge EC2 using the community AMI ami-97591381 (h2o version: 3.13.0.356).
This is my code, which you can run as I made the S3 links public:
library(h2o)
library(jsonlite)
library(curl)
localH2O = h2o.init()
df.truth <- h2o.importFile("https://s3.amazonaws.com/nw.data.test.us.east/df.truth.zeroed", header = T, sep=",")
df.truth$isFemale <- h2o.asfactor(df.truth$isFemale)
hotnames.truth <- fromJSON("https://s3.amazonaws.com/nw.data.test.us.east/hotnames.json", simplifyVector = T)
# Training and validation sets
splits <- h2o.splitFrame(df.truth, c(0.9), seed=1234)
train.truth <- h2o.assign(splits[[1]], "train.truth.hex")
valid.truth <- h2o.assign(splits[[2]], "valid.truth.hex")
# Train a model using non-GPU deeplearning
dl.2 <- h2o.deeplearning(
training_frame = train.truth, model_id="dl.2",
validation_frame = valid.truth,
x=setdiff(hotnames.truth[1:(length(hotnames.truth)/2)], c("isFemale", "nwtcs")),
y="isFemale", stopping_metric = "AUTO", seed = 1,
sparse = F, mini_batch_size = 20)
# Train a model using GPU-enabled deepwater
dw.2 <- h2o.deepwater(
training_frame = train.truth, model_id="dw.2",
validation_frame = valid.truth,
x=setdiff(hotnames.truth[1:(length(hotnames.truth)/2)], c("isFemale", "nwtcs")),
y="isFemale", stopping_metric = "AUTO", seed = 1,
sparse = F, mini_batch_size = 20)
When I inspect the two models, to my surprise I saw large difference in logloss:
Non-GPU
print(dl.2)
Model Details:
==============
H2OBinomialModel: deeplearning
Model ID: dl.2
Status of Neuron Layers: predicting isFemale, 2-class classification, bernoulli distribution, CrossEntropy loss, 160,802 weights/biases, 2.0 MB, 1,041,465 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum
1 1 600 Input 0.00 %
2 2 200 Rectifier 0.00 % 0.000000 0.000000 0.104435 0.102760 0.000000
3 3 200 Rectifier 0.00 % 0.000000 0.000000 0.031395 0.055490 0.000000
4 4 2 Softmax 0.000000 0.000000 0.001541 0.001438 0.000000
mean_weight weight_rms mean_bias bias_rms
1
2 0.018904 0.144034 0.150630 0.415525
3 -0.023333 0.081914 0.545394 0.251275
4 0.029091 0.295439 -0.004396 0.357609
H2OBinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9877 samples **
MSE: 0.1213733
RMSE: 0.3483868
LogLoss: 0.388214
Mean Per-Class Error: 0.2563669
AUC: 0.8433182
Gini: 0.6866365
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 6546 1079 0.141508 =1079/7625
1 836 1416 0.371226 =836/2252
Totals 7382 2495 0.193885 =1915/9877
H2OBinomialMetrics: deeplearning
** Reported on validation data. **
** Metrics reported on full validation frame **
MSE: 0.126671
RMSE: 0.3559087
LogLoss: 0.4005941
Mean Per-Class Error: 0.2585051
AUC: 0.8309913
Gini: 0.6619825
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 11746 3134 0.210618 =3134/14880
1 1323 2995 0.306392 =1323/4318
Totals 13069 6129 0.232160 =4457/19198
GPU-enabled
print(dw.2)
Model Details:
==============
H2OBinomialModel: deepwater
Model ID: dw.2b
Status of Deep Learning Model: MLP: [200, 200], 630.8 KB, predicting isFemale, 2-class classification, 1,708,160 training samples, mini-batch size 20
input_neurons rate momentum
1 600 0.000369 0.900000
H2OBinomialMetrics: deepwater
** Reported on training data. **
** Metrics reported on temporary training frame with 9877 samples **
MSE: 0.1615781
RMSE: 0.4019677
LogLoss: 0.629549
Mean Per-Class Error: 0.3467246
AUC: 0.7289561
Gini: 0.4579122
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4843 2782 0.364852 =2782/7625
1 740 1512 0.328597 =740/2252
Totals 5583 4294 0.356586 =3522/9877
H2OBinomialMetrics: deepwater
** Reported on validation data. **
** Metrics reported on full validation frame **
MSE: 0.1651776
RMSE: 0.4064205
LogLoss: 0.6901861
Mean Per-Class Error: 0.3476629
AUC: 0.7187362
Gini: 0.4374724
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 8624 6256 0.420430 =6256/14880
1 1187 3131 0.274896 =1187/4318
Totals 9811 9387 0.387697 =7443/19198
As seen above, the difference in logloss is huge between non-GPU and GPU models:
Logloss
+----------------------------------+
| | non-GPU | GPU |
+----------------------------------+
| training data | 0.39 | 0.63 |
+----------------------------------|
| validation data | 0.40 | 0.69 |
+----------------------------------+
I understand that due to the stochastic nature of the training I will get different results, but I won't expect such a huge difference between non-GPU and GPU.
h2o.deeplearning is H2O's built-in deep-learning algorithm. It parallelizes very well, works well with large data, but does not use GPUs.
h2o.deepwater is a wrapper around (probably) Tensorflow, and (probably) using your GPU (but it can use the CPU, and it can use different back-ends).
In other words, this is not a difference in using the CPU or using the GPU: you are using two different implementations of deep learning.
BTW, I'd suggest you increase the number of epochs (from the default of 10, to something like 200 - bearing in mind this means it will take 20x longer to run), and see if the difference is still there. Or compare the score history charts, and see if Tensorflow is getting there, but just needs, say, 50% more epochs to get the same logloss score.

Getting uncalibrated probability outputs with Vowpal Wabbit, ad-conversion prediction

I'm trying to use Vowpal Wabbit to predict conversion rate for ads display and I'm getting non-intuitive probability outputs, which are centered at around 36% when the global frequency of the positive class is less than 1%.
The positive/negative imbalance I have in my dataset is 1/100 (I already undersampled the negative class), so I use a weight of 100 in the positive examples.
Negative examples have label -1, and positive ones 1. I used shuf to shuffle positive and negative examples for online learning to work properly.
Sample lines in the vw file:
1 100 'c4ac3440|i search_delay_log:3.58351893846 click_count_log:3.58351893846 banner_impression_count_log:3.98898404656 |c es i_type_2 xvertical_1_61 vertical_1 creat_size_728x90 retargeting
-1 1 'a4d25cf1|i search_delay_log:11.2825684591 click_count_log:11.2825684591 banner_impression_count_log:4.48863636973 |c br i_type_1 xvertical_1_960 vertical_1 creat_size_300x600 retargeting
Now I use the following to create a model from a training set:
vw -d impressions_rand.aa --loss_function logistic -c -k --passes 12 -f model.vw
Output:
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.510760 0.328374 2 2.0 -1.0000 -0.9449 11
0.387521 0.264282 4 4.0 -1.0000 -1.1825 11
1.765374 1.818883 8 107.0 1.0000 -1.7020 11
2.152669 2.444504 51 249.0 1.0000 -3.2953 11
1.289870 0.427071 201 498.0 -1.0000 -3.5498 11
0.878843 0.528943 588 1083.0 1.0000 -1.3394 9
0.852358 0.825872 1176 2166.0 -1.0000 -6.7918 11
0.871977 0.891597 2451 4332.0 -1.0000 -2.7031 11
0.689428 0.506878 4110 8664.0 -1.0000 -2.7525 11
0.638008 0.586589 8517 17328.0 -1.0000 -5.8017 11
0.580220 0.522713 17515 34741.0 1.0000 2.1519 11
0.526281 0.472343 35525 69482.0 -1.0000 -6.2931 9
0.497601 0.468921 71050 138964.0 -1.0000 -7.6245 9
0.479305 0.461008 143585 277928.0 -1.0000 -0.8296 11
0.443734 0.443734 288655 555856.0 -1.0000 -2.5795 11 h
0.438806 0.433925 578181 1111791.0 1.0000 0.8503 11 h
finished run
number of examples per pass = 216000
passes used = 5
weighted example sum = 2072475.000000
weighted label sum = -67475.000000
average loss = 0.432676 h
best constant = -0.065138
best constant's loss = 0.692617
total feature number = 11548690
Now to predict on a test set. The --link logistic should transform the vw outputs to probabilities in the range [0, 1].
vw -d impressions_rand.ab --link logistic -i model.vw -p preds_ab.txt
Output:
predictions = preds_ab.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
68.282379 68.282379 1 1.0 -1.0000 0.0001 9
38.748867 9.215355 2 2.0 -1.0000 0.0174 11
21.256140 3.763414 4 4.0 -1.0000 0.8345 11
11.685329 2.114518 8 8.0 -1.0000 0.3508 11
9.457854 7.230378 16 16.0 -1.0000 0.0069 11
7.371087 5.284320 32 32.0 -1.0000 0.3561 11
7.061980 6.752873 64 64.0 -1.0000 0.6549 11
5.423309 3.784638 128 128.0 -1.0000 0.2597 11
3.252394 1.725597 211 310.0 1.0000 0.7686 11
2.140099 1.052366 330 627.0 1.0000 0.7143 11
1.671550 1.203000 660 1254.0 -1.0000 0.8054 11
1.788466 1.905383 1320 2508.0 -1.0000 0.0676 9
1.508163 1.234410 2502 5076.0 1.0000 0.3921 11
1.282862 1.060063 5061 10209.0 1.0000 0.4258 9
1.119420 0.955977 11013 20418.0 -1.0000 0.6892 11
1.017911 0.916403 22323 40836.0 -1.0000 0.5301 9
0.888435 0.758960 42171 81672.0 -1.0000 0.3500 11
0.787709 0.686983 84243 163344.0 -1.0000 0.2360 9
0.703270 0.618831 170268 326688.0 -1.0000 0.5707 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 397936.000000
weighted label sum = -12936.000000
average loss = 0.684043
best constant = -0.032508
best constant's loss = 0.998943
total feature number = 2216941
This outputs me a predictions file preds_ab.txt like:
0.000095 7c14ae23
0.017367 3e9558bd
0.139393 6a1cd72f
0.834518 dfe76f6e
0.089810 2b88b547
If I calculate the ROC-AUC score of these predictions, I get a value of 0.85 which is close to what I get using scikit-learn (0.90). However the probability outputs are not calibrated at all, since they are much higher than what I would expect (close to 1%). This is the histogram.
This is the reliability curve:
And this is a plot of mean probabilities and positive frequencies when examples are binned by probabilities:
It's obvious that output probabilities are much higher than what would be expected from a well-calibrated classifier.
What am I doing wrong here? What should I investigate?
UPDATE
If I don't use a 100 weight for the positive class examples I get similar non-intuitive results. The mean probabity output is 0.27 (still very far from 1), the reliability plot looks even worse and ROC-AUC is 0.76.
I can confirm I have 237805 negative examples and 2195 positive ones.
Output training:
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = impressions_rand.aa.cache
Reading datafile = impressions_rand.aa
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.0000 11
0.546724 0.400300 2 2.0 -1.0000 -0.7087 11
0.398553 0.250382 4 4.0 -1.0000 -1.3963 11
0.284506 0.170460 8 8.0 -1.0000 -2.2595 11
0.181406 0.078306 16 16.0 -1.0000 -2.8225 11
0.108136 0.034865 32 32.0 -1.0000 -4.2696 11
0.063156 0.018176 64 64.0 -1.0000 -4.7412 11
0.036415 0.009675 128 128.0 -1.0000 -4.2940 11
0.020325 0.004235 256 256.0 -1.0000 -5.9903 11
0.043248 0.066171 512 512.0 -1.0000 -5.5540 11
0.045276 0.047304 1024 1024.0 -1.0000 -4.7065 11
0.044606 0.043935 2048 2048.0 -1.0000 -6.6253 11
0.048938 0.053270 4096 4096.0 -1.0000 -5.9119 11
0.048711 0.048485 8192 8192.0 -1.0000 -2.3949 11
0.048157 0.047603 16384 16384.0 -1.0000 -9.6219 11
0.044306 0.040454 32768 32768.0 -1.0000 -8.8800 11
0.044029 0.043752 65536 65536.0 -1.0000 -5.9218 9
0.042739 0.041450 131072 131072.0 -1.0000 -3.8306 11
0.042986 0.042986 262144 262144.0 -1.0000 -6.0941 11 h
0.042321 0.041655 524288 524288.0 -1.0000 -4.0276 11 h
0.042654 0.042988 1048576 1048576.0 -1.0000 -9.9169 11 h
finished run
number of examples per pass = 216000
passes used = 7
weighted example sum = 1512000.000000
weighted label sum = -1484504.000000
average loss = 0.042763 h
best constant = -4.691161
best constant's loss = 0.051789
total feature number = 16166472
Output testing follows. I've read average loss being larger than best constant loss is an indicator of that something is wrong with my model learning.
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
78.141266 78.141266 1 1.0 -1.0000 0.0001 11
54.228148 30.315029 2 2.0 -1.0000 0.0015 11
33.279501 12.330854 4 4.0 1.0000 0.0472 11
20.358767 7.438034 8 8.0 -1.0000 0.0527 11
15.780043 11.201319 16 16.0 -1.0000 0.1657 11
13.783271 11.786498 32 32.0 -1.0000 0.0012 9
9.318714 4.854158 64 64.0 -1.0000 0.7268 11
6.797651 4.276587 128 128.0 -1.0000 0.1404 9
4.674237 2.550824 256 256.0 -1.0000 0.0516 11
3.269198 1.864159 512 512.0 -1.0000 0.4092 11
2.153033 1.036868 1024 1024.0 -1.0000 0.0425 11
1.481920 0.810807 2048 2048.0 -1.0000 0.2792 11
1.005869 0.529817 4096 4096.0 -1.0000 0.2422 11
0.676574 0.347279 8192 8192.0 -1.0000 0.3003 11
0.452924 0.229274 16384 16384.0 -1.0000 0.2579 11
0.295262 0.137600 32768 32768.0 -1.0000 0.2833 11
0.191513 0.087763 65536 65536.0 -1.0000 0.2616 9
0.126758 0.062003 131072 131072.0 -1.0000 0.2670 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.099565
best constant = -0.981009
best constant's loss = 0.037621
total feature number = 2217159
You say you have one positive example per 100 negative examples on average in the training set. However, you put 100 times more weight on the positive examples, which is (almost) equivalent to repeating each positive example 100 times in the training set. This way the average predicted probability should be around 50%. So you should not be surprised it is not around 1%.
According to the vw output you provided, it seems that there are more than 100 negative examples per one positive in the training set impressions_rand.aa, so the "weighted label sum" is negative (otherwise it should be around 0). Thus, the average predicted probability is not 50% but around 36%.
I solved it thanks to Martin Popel and arielf comments. :)
I forgot to use -t when generating the predictions.
I didn't specify --loss_function logisitc when generating the predictions.
As a result, the model was being updated while testing using the default loss function instead of the logistic one, destroying the model and producing wrong results.
Takeouts:
Use --loss_function logistic also during test to see correct loss outputs.
Remember to use -t if you don't want to update your model while predicting.
This is how the output looks now when testing (without example weighting):
$ vw -d impressions_rand.ab --link logistic --loss_function logistic -i model.vw -t -p preds.txt
only testing
predictions = preds.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = impressions_rand.ab
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.000053 0.000053 1 1.0 -1.0000 0.0001 11
0.000370 0.000687 2 2.0 -1.0000 0.0007 11
1.252868 2.505366 4 4.0 1.0000 0.0067 11
0.638249 0.023630 8 8.0 -1.0000 0.0036 11
0.322060 0.005872 16 16.0 -1.0000 0.0031 11
0.164750 0.007439 32 32.0 -1.0000 0.0000 9
0.084911 0.005072 64 64.0 -1.0000 0.0081 11
0.076905 0.068899 128 128.0 -1.0000 0.0004 9
0.055126 0.033347 256 256.0 -1.0000 0.0000 11
0.052986 0.050847 512 512.0 -1.0000 0.0133 11
0.038351 0.023715 1024 1024.0 -1.0000 0.0000 11
0.037059 0.035767 2048 2048.0 -1.0000 0.0167 11
0.038848 0.040637 4096 4096.0 -1.0000 0.0112 11
0.038903 0.038957 8192 8192.0 -1.0000 0.0281 11
0.041625 0.044348 16384 16384.0 -1.0000 0.0001 11
0.042526 0.043426 32768 32768.0 -1.0000 0.0218 11
0.042538 0.042551 65536 65536.0 -1.0000 0.0000 9
0.042150 0.041763 131072 131072.0 -1.0000 0.0019 11
finished run
number of examples per pass = 207361
passes used = 1
weighted example sum = 207361.000000
weighted label sum = -203423.000000
average loss = 0.042438
best constant = -4.647395
best constant's loss = 0.053670
total feature number = 2217159
You see now reported average loss is less than best constant's loss, and the iterative average losses lay also in the expected interval.
Also, the output probabilities now make perfect sense:

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

Generate a list of primes up to a certain number

I'm trying to generate a list of primes below 1 billion. I'm trying this, but this kind of structure is pretty shitty. Any suggestions?
a <- 1:1000000000
d <- 0
b <- for (i in a) {for (j in 1:i) {if (i %% j !=0) {d <- c(d,i)}}}
That sieve posted by George Dontas is a good starting point. Here's a much faster version with running times for 1e6 primes of 0.095s as opposed to 30s for the original version.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e8) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
fsqr <- floor(sqrt(n))
while (last.prime <= fsqr)
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
sel <- which(primes[(last.prime+1):(fsqr+1)])
if(any(sel)){
last.prime <- last.prime + min(sel)
}else last.prime <- fsqr+1
}
which(primes)
}
Here are some alternate algorithms below coded about as fast as possible in R. They are slower than the sieve but a heck of a lot faster than the questioners original post.
Here's a recursive function that uses mod but is vectorized. It returns for 1e5 almost instantaneously and 1e6 in under 2s.
primes <- function(n){
primesR <- function(p, i = 1){
f <- p %% p[i] == 0 & p != p[i]
if (any(f)){
p <- primesR(p[!f], i+1)
}
p
}
primesR(2:n)
}
The next one isn't recursive and faster again. The code below does primes up to 1e6 in about 1.5s on my machine.
primest <- function(n){
p <- 2:n
i <- 1
while (p[i] <= sqrt(n)) {
p <- p[p %% p[i] != 0 | p==p[i]]
i <- i+1
}
p
}
BTW, the spuRs package has a number of prime finding functions including a sieve of E. Haven't checked to see what the speed is like for them.
And while I'm writing a very long answer... here's how you'd check in R if one value is prime.
isPrime <- function(x){
div <- 2:ceiling(sqrt(x))
!any(x %% div == 0)
}
This is an implementation of the Sieve of Eratosthenes algorithm in R.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e6) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
for(i in last.prime:floor(sqrt(n)))
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
}
which(primes)
}
sieve(1000000)
Prime Numbers in R
The OP asked to generate all prime numbers below one billion. All of the answers provided thus far are either not capable of doing this, will take a long a time to execute, or currently not available in R (see the answer by #Charles). The package RcppAlgos (I am the author) is capable of generating the requested output in just over 1 second using only one thread. It is based off of the segmented sieve of Eratosthenes by Kim Walisch.
RcppAlgos
library(RcppAlgos)
system.time(primeSieve(1e9)) ## using 1 thread
user system elapsed
1.099 0.077 1.176
Using Multiple Threads
And in recent versions (i.e. >= 2.3.0), we can utilize multiple threads for even faster generation. For example, now we can generate the primes up to 1 billion in under half a second!
system.time(primeSieve(10^9, nThreads = 8))
user system elapsed
2.046 0.048 0.375
Summary of Available Packages in R for Generating Primes
library(schoolmath)
library(primefactr)
library(sfsmisc)
library(primes)
library(numbers)
library(spuRs)
library(randtoolbox)
library(matlab)
## and 'sieve' from #John
Before we begin, we note that the problems pointed out by #Henrik in schoolmath still exists. Observe:
## 1 is NOT a prime number
schoolmath::primes(start = 1, end = 20)
[1] 1 2 3 5 7 11 13 17 19
## This should return 1, however it is saying that 52
## "prime" numbers less than 10^4 are divisible by 7!!
sum(schoolmath::primes(start = 1, end = 10^4) %% 7L == 0)
[1] 52
The point is, don't use schoolmath for generating primes at this point (no offense to the author... In fact, I have filed an issue with the maintainer).
Let's look at randtoolbox as it appears to be incredibly efficient. Observe:
library(microbenchmark)
## the argument for get.primes is for how many prime numbers you need
## whereas most packages get all primes less than a certain number
microbenchmark(priRandtoolbox = get.primes(78498),
priRcppAlgos = RcppAlgos::primeSieve(10^6), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRandtoolbox 1.00000 1.00000 1.000000 1.000000 1.000000 1.0000000 100
priRcppAlgos 12.79832 12.55065 6.493295 7.355044 7.363331 0.3490306 100
A closer look reveals that it is essentially a lookup table (found in the file randtoolbox.c from the source code).
#include "primes.h"
void reconstruct_primes()
{
int i;
if (primeNumber[2] == 1)
for (i = 2; i < 100000; i++)
primeNumber[i] = primeNumber[i-1] + 2*primeNumber[i];
}
Where primes.h is a header file that contains an array of "halves of differences between prime numbers". Thus, you will be limited by the number of elements in that array for generating primes (i.e. the first one hundred thousand primes). If you are only working with smaller primes (less than 1,299,709 (i.e. the 100,000th prime)) and you are working on a project that requires the nth prime, randtoolbox is the way to go.
Below, we perform benchmarks on the rest of the packages.
Primes up to One Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^6),
priNumbers = numbers::Primes(10^6),
priSpuRs = spuRs::primesieve(c(), 2:10^6),
priPrimes = primes::generate_primes(1, 10^6),
priPrimefactr = primefactr::AllPrimesUpTo(10^6),
priSfsmisc = sfsmisc::primes(10^6),
priMatlab = matlab::primes(10^6),
priJohnSieve = sieve(10^6),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.000000 1.00000 1.00000 1.000000 1.00000 1.00000 100
priNumbers 21.550402 23.19917 26.67230 23.140031 24.56783 53.58169 100
priSpuRs 232.682764 223.35847 233.65760 235.924538 236.09220 212.17140 100
priPrimes 46.591868 43.64566 40.72524 39.106107 39.60530 36.47959 100
priPrimefactr 39.609560 40.58511 42.64926 37.835497 38.89907 65.00466 100
priSfsmisc 9.271614 10.68997 12.38100 9.761438 11.97680 38.12275 100
priMatlab 21.756936 24.39900 27.08800 23.433433 24.85569 49.80532 100
priJohnSieve 10.630835 11.46217 12.55619 10.792553 13.30264 38.99460 100
Primes up to Ten Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^7),
priNumbers = numbers::Primes(10^7),
priSpuRs = spuRs::primesieve(c(), 2:10^7),
priPrimes = primes::generate_primes(1, 10^7),
priPrimefactr = primefactr::AllPrimesUpTo(10^7),
priSfsmisc = sfsmisc::primes(10^7),
priMatlab = matlab::primes(10^7),
priJohnSieve = sieve(10^7),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 30.57896 28.91780 31.26486 30.47751 29.81762 40.43611 20
priSpuRs 533.99400 497.20484 490.39989 494.89262 473.16314 470.87654 20
priPrimes 125.04440 114.71349 112.30075 113.54464 107.92360 103.74659 20
priPrimefactr 52.03477 50.32676 52.28153 51.72503 52.32880 59.55558 20
priSfsmisc 16.89114 16.44673 17.48093 16.64139 18.07987 22.88660 20
priMatlab 30.13476 28.30881 31.70260 30.73251 32.92625 41.21350 20
priJohnSieve 18.25245 17.95183 19.08338 17.92877 18.35414 32.57675 20
Primes up to One Hundred Million
For the next two benchmarks, we only consider RcppAlgos, numbers, sfsmisc, matlab, and the sieve function by #John.
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^8),
priNumbers = numbers::Primes(10^8),
priSfsmisc = sfsmisc::primes(10^8),
priMatlab = matlab::primes(10^8),
priJohnSieve = sieve(10^8),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 35.64097 33.75777 32.83526 32.25151 31.74193 31.95457 20
priSfsmisc 21.68673 20.47128 20.01984 19.65887 19.43016 19.51961 20
priMatlab 35.34738 33.55789 32.67803 32.21343 31.56551 31.65399 20
priJohnSieve 23.28720 22.19674 21.64982 21.27136 20.95323 21.31737 20
Primes up to One Billion
N.B. We must remove the condition if(n > 1e8) stop("n too large") in the sieve function.
## See top section
## system.time(primeSieve(10^9))
## user system elapsed
## 1.099 0.077 1.176 ## RcppAlgos single-threaded
## gc()
system.time(matlab::primes(10^9))
user system elapsed
31.780 12.456 45.549 ## ~39x slower than RcppAlgos
## gc()
system.time(numbers::Primes(10^9))
user system elapsed
32.252 9.257 41.441 ## ~35x slower than RcppAlgos
## gc()
system.time(sieve(10^9))
user system elapsed
26.266 3.906 30.201 ## ~26x slower than RcppAlgos
## gc()
system.time(sfsmisc::primes(10^9))
user system elapsed
24.292 3.389 27.710 ## ~24x slower than RcppAlgos
From these comparison, we see that RcppAlgos scales much better as n gets larger.
_________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos | 1.00 | 1.00 | 1.00 | 1.00 |
| sfsmisc | 9.76 | 16.64 | 19.66 | 23.56 |
| JohnSieve | 10.79 | 17.93 | 21.27 | 25.68 |
| numbers | 23.14 | 30.48 | 32.25 | 34.86 |
| matlab | 23.43 | 30.73 | 32.21 | 38.73 |
---------------------------------------------------------
The difference is even more dramatic when we utilize multiple threads:
microbenchmark(ser = primeSieve(1e6),
par = primeSieve(1e6, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 1.741342 1.492707 1.481546 1.512804 1.432601 1.275733 100
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e7),
par = primeSieve(1e7, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 2.632054 2.50671 2.405262 2.418097 2.306008 2.246153 100
par 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e8),
par = primeSieve(1e8, nThreads = 8), unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
ser 2.914836 2.850347 2.761313 2.709214 2.755683 2.438048 20
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
microbenchmark(ser = primeSieve(1e9),
par = primeSieve(1e9, nThreads = 8), unit = "relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
ser 3.081841 2.999521 2.980076 2.987556 2.961563 2.841023 10
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
And multiplying the table above by the respective median times for the serial results:
_____________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos-Par | 1.00 | 1.00 | 1.00 | 1.00 |
| RcppAlgos-Ser | 1.51 | 2.42 | 2.71 | 2.99 |
| sfsmisc | 14.76 | 40.24 | 53.26 | 70.39 |
| JohnSieve | 16.32 | 43.36 | 57.62 | 76.72 |
| numbers | 35.01 | 73.70 | 87.37 | 104.15 |
| matlab | 35.44 | 74.31 | 87.26 | 115.71 |
-------------------------------------------------------------
Primes Over a Range
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^9, 10^9 + 10^6),
priNumbers = numbers::Primes(10^9, 10^9 + 10^6),
priPrimes = primes::generate_primes(10^9, 10^9 + 10^6),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.0000 1.0000 1.000 1.0000 1.0000 1.0000 20
priNumbers 115.3000 112.1195 106.295 110.3327 104.9106 81.6943 20
priPrimes 983.7902 948.4493 890.243 919.4345 867.5775 708.9603 20
Primes up to 10 billion in Under 6 Seconds
## primes less than 10 billion
system.time(tenBillion <- RcppAlgos::primeSieve(10^10, nThreads = 8))
user system elapsed
26.077 2.063 5.602
length(tenBillion)
[1] 455052511
## Warning!!!... Large object created
tenBillionSize <- object.size(tenBillion)
print(tenBillionSize, units = "Gb")
3.4 Gb
Primes Over a Range of Very Large Numbers:
Prior to version 2.3.0, we were simply using the same algorithm for numbers of every magnitude. This is okay for smaller numbers when most of the sieving primes have at least one multiple in each segment (Generally, the segment size is limited by the size of L1 Cache ~32KiB). However, when we are dealing with larger numbers, the sieving primes will contain many numbers that will have fewer than one multiple per segment. This situation creates a lot of overhead, as we are performing many worthless checks that pollutes the cache. Thus, we observe much slower generation of primes when the numbers are very large. Observe for version 2.2.0 (See Installing older version of R package):
## Install version 2.2.0
## packageurl <- "http://cran.r-project.org/src/contrib/Archive/RcppAlgos/RcppAlgos_2.2.0.tar.gz"
## install.packages(packageurl, repos=NULL, type="source")
system.time(old <- RcppAlgos::primeSieve(1e15, 1e15 + 1e9))
user system elapsed
7.932 0.134 8.067
And now using the cache friendly improvement originally developed by Tomás Oliveira, we see drastic improvements:
## Reinstall current version from CRAN
## install.packages("RcppAlgos"); library(RcppAlgos)
system.time(cacheFriendly <- primeSieve(1e15, 1e15 + 1e9))
user system elapsed
2.258 0.166 2.424 ## Over 3x faster than older versions
system.time(primeSieve(1e15, 1e15 + 1e9, nThreads = 8))
user system elapsed
4.852 0.780 0.911 ## Over 8x faster using multiple threads
Take Away
There are many great packages available for generating primes
If you are looking for speed in general, there is no match to RcppAlgos::primeSieve, especially for larger numbers.
If you are working with small primes, look no further than randtoolbox::get.primes.
If you need primes in a range, the packages numbers, primes, & RcppAlgos are the way to go.
The importance of good programming practices cannot be overemphasized (e.g. vectorization, using correct data types, etc.). This is most aptly demonstrated by the pure base R solution provided by #John. It is concise, clear, and very efficient.
Best way that I know of to generate all primes (without getting into crazy math) is to use the Sieve of Eratosthenes.
It is pretty straightforward to implement and allows you calculate primes without using division or modulus. The only downside is that it is memory intensive, but various optimizations can be made to improve memory (ignoring all even numbers for instance).
This method should be Faster and simpler.
allPrime <- function(n) {
primes <- rep(TRUE, n)
primes[1] <- FALSE
for (i in 1:sqrt(n)) {
if (primes[i]) primes[seq(i^2, n, i)] <- FALSE
}
which(primes)
}
0.12 second on my computer for n = 1e6
I implemented this in function AllPrimesUpTo in package primefactr.
I recommend primegen, Dan Bernstein's implementation of the Atkin-Bernstein sieve. It's very fast and will scale well to other problems. You'll need to pass data out to the program to use it, but I imagine there are ways to do that?
You can also cheat and use the primes() function in the schoolmath package :D
The isPrime() function posted above could use sieve(). One only needs to check if any of
the primes < ceiling(sqrt(x)) divide x with no remainder. Need to handle 1 and 2, also.
isPrime <- function(x) {
div <- sieve(ceiling(sqrt(x)))
(x > 1) & ((x == 2) | !any(x %% div == 0))
}
No suggestions, but allow me an extended comment of sorts. I ran this experiment with the following code:
get_primes <- function(n_min, n_max){
options(scipen=999)
result = vector()
for (x in seq(max(n_min,2), n_max)){
has_factor <- F
for (p in seq(2, ceiling(sqrt(x)))){
if(x %% p == 0) has_factor <- T
if(has_factor == T) break
}
if(has_factor==F) result <- c(result,x)
}
result
}
and after almost 24 hours of uninterrupted computer operations, I got a list of 5,245,897 primes. The π(1,000,000,000) = 50,847,534, so it would have taken 10 days to complete this calculation.
Here is the file of these first ~ 5 million prime numbers.
for (i in 2:1000) {
a = (2:(i-1))
b = as.matrix(i%%a)
c = colSums(b != 0)
if (c == i-2)
{
print(i)
}
}
Every number (i) before (a) is checked against the list of prime numbers (n) generated by checking for number (i-1)
Thanks for suggestions:
prime = function(a,n){
n=c(2)
i=3
while(i <=a){
for(j in n[n<=sqrt(i)]){
r=0
if (i%%j == 0){
r=1}
if(r==1){break}
}
if(r!=1){n = c(n,i)}
i=i+2
}
print(n)
}

Resources