In contextual bandit for Vowpal wabbit, does the --cb_explore option includes training an optimum predictor (--cb option) as well? - vowpalwabbit

When using Vowpal wabbit for contextual bandits, here is my understanding so far,
We can build a predictor model for predicting the rewards
We can also then use an exploration strategy to choose actions (each action's reward is obtained from the predictions from the predictor model of #1 above)
I can use the --cb option to optimize a predictor based on the already collected contextual bandit data. The --cb obtain is only for building a model for predicting the rewards and it doesn't contain any exploration is choosing the rewards (it always picks the maximum reward). Hence this is the functionality for #1 above. Doubly robust is the default for --cb and you can specify other method using --cb_type flag
The --cb_explore option performs exploration for the rewards (#2 above). What I am not sure is what method it used for predicting the actions' rewards when I specify the --cb_explore? All the examples refers to the exploration strategies and doesn't specify the default prediction strategy used for --cb_explore,

If no exploration strategy is provided the default will be epsilon greedy. You can see some of the other alternatives here

Related

How `vw --audit` internally computes the weights of the features?

In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
From what I understand vowpawabbit tries to fit one linear model to each arm.
So if weights were calculated using an average across all the arms, then they would correlate with getting a reward generally, instead of which features makes the model pick one variant from another.
I am interested know out how they are calculated to see how I can interpret the results obtained. I tried searching its Github repository but could not find anything meaningful.
I am interested know out how they are calculated to see how I can interpret the results obtained.
Unfortunately knowing the first does not lead to knowing the second.
Your question is concerned with contextual bandits, but it is important to note that interpreting model parameters is an issue that also occurs in supervised learning. Machine learning has made progress recently (i.e., my lifetime) largely by focusing concern on quality of predictions rather than meaningfulness of model parameters. In a blog post, Phoebe Wong outlines the issue while being entertaining.
The bottom line is that our models are not causal, so you simply cannot conclude because "the weight of feature X is for arm A is large means that if I were to intervene in the system and increase this feature value that I will get more reward for playing arm A".
We are currently working on tools for model inspection that leverage techniques such as permutation importance that will help you answer questions like "if I were to stop using a particular feature how would the frequency of playing each arm change for the trained policy". We're hoping that is helpful information.
Having said all that, let me try to answer your original question ...
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
The format is documented here. Assuming you are using --cb (not --cb_adf) then there are a fixed number of arms and so the offset field will increment over the arms. So for an example like
1:2:0.4 |foo bar
with --cb 4 you'll get an audit output with namespace of foo, feature of bar, and offset of 0, 1, 2, and 3.
Interpreting the output when using --cb_adf is possible but difficult to explain succinctly.
From what I understand vowpawabbit tries to fit one linear model to each arm.
Shorter answer: With --cb_type dm, essentially VW independently tries to predict the average reward for each arm using only examples where the policy played that arm. So the weight you get from audit at a particular offset N is analogous to what you would get from a supervised learning model trained to predict reward on a subset of the historical data consisting solely of times the historical policy played arm N. With other --cb_type settings the interpretation is more complicated.
Longer answer: "Linear model" refers to the representation being used. VW can incorporate nonlinearities into the model but let's ignore that for now. "Fit" is where some important details are. VW takes the partial feedback information of a CB problem (partial feedback = "for this example you don't know the reward of the arms not pulled") and reduces it to a full feedback supervised learning problem (full feedback = "for this example you do the reward of all arms"). The --cb_type argument selects the reduction strategy. There are several papers on the topic, a good place to start is Dudik et. al. and then look for papers that cite this paper. In terms of code, ultimately things are grounded here, but the code is written more for performance than intelligibility.

Vowpal Wabbit: question on training contextual bandit on historical data

I know from this page, that there is an option to train a Contextual Bandit VW model based on historical contextual bandit data collected using some exploration policy:
VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy.
And it is done by specifying --cb and passing data formatted like action:cost:probability | features :
1:2:0.4 | a c
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d
My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory (Edit: biased) heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).
I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but it seemed that the PMF collapses to zero over most actions.
My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).
Yes, set the probability to 1. With a degenerate logging policy there are no theoretical guarantees but in practice this can be helpful for initialization. Going forward you'll want to have some nondeterminism in your logging policy or you will never improve.
I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but the PMF collapses to zero over most actions.
If you actually have historical data that is fully labeled you can use the warm start functionality. If you are pretending you have fully labeled data I'm not sure it's better than just setting the probability to 1.

Is H2O DAI's MLI display menu dependent on the algorithms used in its experiments?

I see H2O DAI picks up the optimized algorithm for the dataset automatically. I hear that the contents of MLI (machine learning interpretation) from other platforms (like SAS Viya) is dependent on the algorithm it uses. For example, LOCO is not available for GBM, etc. (Of course, this is a purely hypothetical example.)
Is it the same with H2O DriverlessAI ? Or does it always show the same MLI menus regardless of the algorithms it used?
Currently, MLI will always display the same dashboard for any DAI algorithm, with the following exceptions: Shapley plots are not currently supported for RuleFit and TensorFlow models, Multinomial currently only show Shapley and Feature Importance (global and local), and Time Series experiments are not yet supported for MLI. What this means is you can expect to always see: K-LIME/LIME-SUP, Surrogate Decision Tree, Partial Dependence Plot and Individual Conditional Expectation. Note this may change in the future, for the most up-to-date details please see the documentation.

Contextual Bandit using Vowpal wabbit

In this case, one of the inputs is the probability of choosing an arm/action but how do we find that probability?
Isn't finding that probability itself a big task in hand?
Supplying the probability means you are taking a scenario where you are feeding actions taken historically, e.g. from a log, rather than performing the real online scenario. This is useful because (at least some of) Vowpal's Contextual Bandits models can be bootstrapped from historical data. Meaning, a Contextual Bandits policy learnt over historical data can outperform one that learns online from scratch ― something that you can do only if you have historical data relevant to the online scenario of yours.
The Wiki page has been recently edited to better reflect that this format generalizes for this case.
Another (contrived) use case for including probabilities might be that you are acting against multiple environments, but in any event to the best of my understanding the probability here can be interpreted as a mere frequency.
As such, my understanding is you do not have to supply the probability part in your input, when not feeding in historical interaction data. Just skip it as in the example here.

VowpalWabbit: Differences and scalability

I am trying to ascertain how VowpalWabbit's "state" is maintained as the size of our input set grows. In a typical machine learning environment, if I have 1000 input vectors, I would expect to send all of those at once, wait for a model building phase to complete, and then use the model to create new predictions.
In VW, it appears that the "online" nature of the algorithm shifts this paradigm to be more performant and capable of adjusting in real-time.
How is this real-time model modification implemented ?
Does VW take increasing resources with respect to total input data size over time ? That is, as i add more data to my VW model (when it is small), do the real-time adjustment calculations begin to take longer once the cumulative # of feature vector inputs increases to 1000s, 10000s, or millions?
Just to add to carlosdc's good answer.
Some of the features that set vowpal wabbit apart, and allow it to scale to tera-feature (1012) data-sizes are:
The online weight vector:
vowpal wabbit maintains an in memory weight-vector which is essentially the vector of weights for the model that it is building. This is what you call "the state" in your question.
Unbounded data size:
The size of the weight-vector is proportional to the number of features (independent input variables), not the number of examples (instances). This is what makes vowpal wabbit, unlike many other (non online) learners, scale in space. Since it doesn't need to load all the data into memory like a typical batch-learner does, it can still learn from data-sets that are too big to fit in memory.
Cluster mode:
vowpal wabbit supports running on multiple hosts in a cluster, imposing a binary tree graph structure on the nodes and using the all-reduce reduction from leaves to root.
Hash trick:
vowpal wabbit employs what's called the hashing trick. All feature names get hashed into an integer using murmurhash-32. This has several advantages: it is very simple and time-efficient not having to deal with hash-table management and collisions, while allowing features to occasionally collide. It turns out (in practice) that a small number of feature collisions in a training set with thousands of distinct features is similar to adding an implicit regularization term. This counter-intuitively, often improves model accuracy rather than decrease it. It is also agnostic to sparseness (or density) of the feature space. Finally, it allows the input feature names to be arbitrary strings unlike most conventional learners which require the feature names/IDs to be both a) numeric and b) unique.
Parallelism:
vowpal wabbit exploits multi-core CPUs by running the parsing and learning in two separate threads, adding further to its speed. This is what makes vw be able to learn as fast as it reads data. It turns out that most supported algorithms in vw, counter-intuitively, are bottlenecked by IO speed, rather than by learning speed.
Checkpointing and incremental learning:
vowpal wabbit allows you to save your model to disk while you learn, and then to load the model and continue learning where you left off with the --save_resume option.
Test-like error estimate:
The average loss calculated by vowpal wabbit "as it goes" is always on unseen (out of sample) data (*). This eliminates the need to bother with pre-planned hold-outs or do cross validation. The error rate you see during training is 'test-like'.
Beyond linear models:
vowpal wabbit supports several algorithms, including matrix factorization (roughly sparse matrix SVD), Latent Dirichlet Allocation (LDA), and more. It also supports on-the-fly generation of term interactions (bi-linear, quadratic, cubic, and feed-forward sigmoid neural-net with user-specified number of units), multi-class classification (in addition to basic regression and binary classification), and more.
There are tutorials and many examples in the official vw wiki on github.
(*) One exception is if you use multiple passes with --passes N option.
VW is a (very) sophisticated implementation of stochastic gradient descent. You can read more about stochastic gradient descent here
It turns out that a good implementation of stochastic gradient descent is basically I/O bound, it goes as fast as you can get it the data, so VW has some sophisticated data structures to "compile" the data.
Therefore the answer the answer to question (1) is by doing stochastic gradient descent and the answer to question (2) is definitely not.

Resources