Iterative training using VowpalWabbit - vowpalwabbit

I am trying to perform iterative testing using VW.
Ideally I would be able to:
Train and save a model initial_model.vw (I have tested this and it works)
Load this model, add additional data to it, and save it again (to new_model.vw)
Use this new model to make predictions that the first model was not able to make to prove the iterative training has been successful.
I found one person also trying to do this (how to retrain the model for sequence of files in vowpal wabbit) but when I run my code and try to retrain with additional data, it seems to overwrite the old data instead of adding to it.
Here is the basic outline of the code I am using:
Initial training and saving:
vw initial_data.txt -b 26 --learning_rate 1.5 --passes 10 --
probabilities --loss_function=logistic --oaa 80 --save_resume --kill_cache
--cache_file a.cache -f initial_model.vw
Retraining with new data:
vw new_data.txt -b 26 --learning_rate 1.5 --
passes 10 -i initial_model.vw --probabilities --loss_function=logistic --
oaa 80 --save_resume --kill_cache --cache_file a.cache -f new_model.vw
I know that this is not enough to reproduce what I am doing but I just want to know if there are any problems with my arguments and if this should be working in theory. When I use my retrained model to make predictions, it is only accurate for test cases which are included in the new data, not anything that was covered in the original training file. Help appreciated!

I can see 2 potential issues with the arguments given in the question.
These may be ok, if you actually meant to use them this way, and you really know what you're doing, but they seem a bit suspect.
1) Whenever you run vw with multiple passes over the same data (--passes <n>), vw implicitly switches to hold-out mode with a 1 in 10 examples, held-out. Held-out examples are used only for error estimation, and not for learning to avoid over-fitting. If this is what you meant to do, then fine, but if you don't want to hold-out any of your examples, you should use the option --holdout_off, and be aware that the chances of over-fitting are increased.
2) The initial learning rate (--learning_rate 1.5) seems high, it increases over-fitting chances. If you use it because you end up with a lower training-loss, this is the wrong thing to do. in ML the goal is not to minimize training-loss but generalization loss.
Also: setting the initial learning-rate on the 2nd batch seems to contradict the --save_resume option. The goal of --save_resume is to start an new batch with low (already decayed, as saved in the model) per-feature learning rates (AdaGrad style). Making the learning rate jump at the start may make the first examples in the 2nd batch much more important than all the decayed features from the 1st batch.
Tip: you can get a feel of how well you're doing, by piping the progress output into the plotting utility vw-convergence:
vw -P 1.1 ... data.txt 2>&1 | vw-convergence
(note: vw-convergence requires R)

Related

Proper way to evaluate policy + exploration offline in Vowlpal Wabbit

My use case is to retrain/make predictions using VW CB in batch mode (retrain/inference occurs nightly).
I'm reading this tutorial for offline policy evaluation in the batch scenario. I'm training on a logged dataset using:
--cb_adf --save_resume -f {MODEL_PATH} -d ./data/train.txt
and in order to tune hyperparameter epsilon on batch predictions, I run the following commands 3 times on a separate dataset using:
-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt
whichever gives the lowest average loss is the optimal epsilon.
Am I using the right options? My confusion mostly comes from the another option --explore_eval. What is the difference between --explore_eval and cb_explore_adf and what is the right way to evaluate model+exploration offline? Should I just run
--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt
and whichever gives the lowest average loss is the optimal epsilon.
-i {MODEL_PATH} -t --cb_explore_adf --epsilon 0.1/0.2/0.3 -d ./data/eval.txt
I predict the result of this experiment: the optimal epsilon is the smallest. This is because after data has been collected, there is no value to exploration. In order to assess exploration, you have to change the data available at training in a manner sensitive to the exploration algorithm. Which brings us to ...
--explore_eval --epsilon 0.1/0.2/0.3 -d ./data/train+eval.txt
'--explore_eval' is designed to assess exploration. It requires more data to work well (since it discards the data if the exploration doesn't match) but allows you to assess exploration since it simulates the fog of war.
If you are testing other model hyperparameters such as base learning algorithm or interactions, the extra data overhead of '--explore_eval' is unnecessary.

Gensim Word2Vec Model trained but not saved

I am using gensim and executed the following code (simplified):
model = gensim.models.Word2Vec(...)
mode.build_vocab(sentences)
model.train(...)
model.save('file_name')
After days my code finished model.train(...). However, during saving, I experienced:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I noticed that there were some npy files generated:
<...>.trainables.syn1neg.npy
<...>.trainables.vectors_lockf.npy
<...>.wv.vectors.npy
Are those intermediate results I can re-use or do I have to rerun the entire process?
Those are parts of the saved model, but unless the master file_name file (a Python-pickled object) exists and is complete, they may be hard to re-use.
However, if your primary interest is the final word-vectors, those are in the .wv.vectors.npy file. If it appears to be full-length (same size at the syn1neg file), it may be complete. What you're missing is the dict that tells you which word is in which index.
So, the following might work:
Repeat the original process, with the exact same corpus & model parameters, but only through the build_vocab() step. At that point, the new model.wv.vocab dict should be identical as the one from the failed-save-run.
Save that model, without ever train()ing it, to a new filename.
Confirming that newmodel.wv.vectors.npy (with randomly-initialized untrained vectors) is the same size as oldmodel.wv.vectors.npy, copy the oldmodel file to the newmodel's name.
Re-load the new model, and run some sanity checks that the words make sense.
Perhaps, save off just the word-vectors, using something like newmodel.wv.save() or newmodel.wvsave_word2vec_format().
Potentially, the resurrected newmodel could also be patched to use the old syn1neg file as well, if it appears complete. It might work to further train the patched model (either with or without having reused the older syn1neg).
Separately: only the very largest corpuses, or an installation missing the gensim cython optimizations, or a machine without enough RAM (and thus swapping during training), would usually require a training session taking days. You might be able to run much faster. Check:
Is any virtual-memory swapping happening during the entire training? If it is, it will be disastrous for training throughput, and you should use a machine with more RAM or be more aggressive about trimming the vocabulary/model size with a higher min_count. (Smaller min_count values mean a larger model, slower training, poor-quality vectors for words with just a few examples, and also counterintuitively worse-quality vectors for more-frequent words too, because of interference from the noisy rare words. It's usually better to ignore lowest-frequency words.)
Is there any warning displayed about a "slow version" (pure Python with no effective multi-threading) being used? If so your training will be ~100X slower than if that problem is resolved. If the optimized code is available, maximum training throughput will likely be achieved with some workers value between 3 and 12 (but never larger than the number of machine CPU cores).
For a very large corpus, the sample parameter can be made more aggressive – such as 1e-04 or 1e-05 instead of the default 1e-03 – and it may both speed training and improve vector quality, by avoiding lots of redundant overtraining of the most-frequent words.
Good luck!

How do Statistica's %75 and %25 Data Sampling & 10 fold Cross Validation works together?

I made an analysis on some data using Dell's Statistica software. I am using this analysis in a scientific paper. Although data mining is not my primary topic I took Data Mining class before and have some knowledge.
I know that data is either separated as %75 %25 (numbers may change) training and test parts or n fold cross validation is used to test the model performance.
In Statistica SVM modeling prior to execution of model there are tabs to make configurations. In data sampling tab I entered %75, %25 separation and in cross-validation tab I entered 10 -fold cross validation. In the output, I see that the data was actually separated as training and test (model predictions are given for test values).
There is also a cross-validation error. I will copy results below. I have difficulty in the understanding and in the interpretation of this output. I hope someone who know better statistics compared to me and/or who is more experienced to this tools may explain how it works to me?
Ferda
Support Vector machine results SVM type:
Regression type 1 (capacity=9.000, epsilon=0.100) Kernel type:
Radial Basis Function (gamma=0.053) Number of support vectors = 705
(674 bounded) Cross-validation error = 0.244
Mean error squared = 1.830(Train), 0.193(Test), 1.267(Overall) S.D. ratio =
0.952(Train), 37076026627971.336(Test), 0.977(Overall) Correlation coefficient = 0.314(Train), -0.000(Test), 0.272(Overall)
I found out that Statistica website has an answer for my misunderstanding. In Sampling tab data may be separated into training and test sets and in cross- validation tab, if for example 10 is selected then 10-fold cross validation is used to decide the proper ni, epsilon etc. like SVM parameters for the execution of the SVM modeling.
This explanation cleared out my problem. I hope it helps to people in similar situations...
Ferda

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Resources