I know from this page, that there is an option to train a Contextual Bandit VW model based on historical contextual bandit data collected using some exploration policy:
VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy.
And it is done by specifying --cb and passing data formatted like action:cost:probability | features :
1:2:0.4 | a c
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d
My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory (Edit: biased) heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).
I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but it seemed that the PMF collapses to zero over most actions.
My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).
Yes, set the probability to 1. With a degenerate logging policy there are no theoretical guarantees but in practice this can be helpful for initialization. Going forward you'll want to have some nondeterminism in your logging policy or you will never improve.
I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but the PMF collapses to zero over most actions.
If you actually have historical data that is fully labeled you can use the warm start functionality. If you are pretending you have fully labeled data I'm not sure it's better than just setting the probability to 1.
Related
When using Vowpal wabbit for contextual bandits, here is my understanding so far,
We can build a predictor model for predicting the rewards
We can also then use an exploration strategy to choose actions (each action's reward is obtained from the predictions from the predictor model of #1 above)
I can use the --cb option to optimize a predictor based on the already collected contextual bandit data. The --cb obtain is only for building a model for predicting the rewards and it doesn't contain any exploration is choosing the rewards (it always picks the maximum reward). Hence this is the functionality for #1 above. Doubly robust is the default for --cb and you can specify other method using --cb_type flag
The --cb_explore option performs exploration for the rewards (#2 above). What I am not sure is what method it used for predicting the actions' rewards when I specify the --cb_explore? All the examples refers to the exploration strategies and doesn't specify the default prediction strategy used for --cb_explore,
If no exploration strategy is provided the default will be epsilon greedy. You can see some of the other alternatives here
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
From what I understand vowpawabbit tries to fit one linear model to each arm.
So if weights were calculated using an average across all the arms, then they would correlate with getting a reward generally, instead of which features makes the model pick one variant from another.
I am interested know out how they are calculated to see how I can interpret the results obtained. I tried searching its Github repository but could not find anything meaningful.
I am interested know out how they are calculated to see how I can interpret the results obtained.
Unfortunately knowing the first does not lead to knowing the second.
Your question is concerned with contextual bandits, but it is important to note that interpreting model parameters is an issue that also occurs in supervised learning. Machine learning has made progress recently (i.e., my lifetime) largely by focusing concern on quality of predictions rather than meaningfulness of model parameters. In a blog post, Phoebe Wong outlines the issue while being entertaining.
The bottom line is that our models are not causal, so you simply cannot conclude because "the weight of feature X is for arm A is large means that if I were to intervene in the system and increase this feature value that I will get more reward for playing arm A".
We are currently working on tools for model inspection that leverage techniques such as permutation importance that will help you answer questions like "if I were to stop using a particular feature how would the frequency of playing each arm change for the trained policy". We're hoping that is helpful information.
Having said all that, let me try to answer your original question ...
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
The format is documented here. Assuming you are using --cb (not --cb_adf) then there are a fixed number of arms and so the offset field will increment over the arms. So for an example like
1:2:0.4 |foo bar
with --cb 4 you'll get an audit output with namespace of foo, feature of bar, and offset of 0, 1, 2, and 3.
Interpreting the output when using --cb_adf is possible but difficult to explain succinctly.
From what I understand vowpawabbit tries to fit one linear model to each arm.
Shorter answer: With --cb_type dm, essentially VW independently tries to predict the average reward for each arm using only examples where the policy played that arm. So the weight you get from audit at a particular offset N is analogous to what you would get from a supervised learning model trained to predict reward on a subset of the historical data consisting solely of times the historical policy played arm N. With other --cb_type settings the interpretation is more complicated.
Longer answer: "Linear model" refers to the representation being used. VW can incorporate nonlinearities into the model but let's ignore that for now. "Fit" is where some important details are. VW takes the partial feedback information of a CB problem (partial feedback = "for this example you don't know the reward of the arms not pulled") and reduces it to a full feedback supervised learning problem (full feedback = "for this example you do the reward of all arms"). The --cb_type argument selects the reduction strategy. There are several papers on the topic, a good place to start is Dudik et. al. and then look for papers that cite this paper. In terms of code, ultimately things are grounded here, but the code is written more for performance than intelligibility.
I am building a recommendation system for my company and have a question about the formula to calculate the precision#K and recall#K which I couldn't find on Google.
With precision#K, the general formula would be the proportion of recommended items in the top-k set that are relevant.
My question is how to define which items are relevant and which are not because a user doesn't necessarily have interactions with all available items but only a small subset of them. What if there is a lack in ground-truth for the top-k recommended items, meaning that the user hasn't interacted with some of them so we don't have the actual rating? Should we ignore them from the calculation or consider them irrelevant items?
The following article suggests to ignore these non-interactions items but I am not really sure about that.
https://medium.com/#m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
Thanks a lot in advance.
You mention "recommended items" so I'll assume you're talking about calculating precision for a recommender engine, i.e. the number of predictions in the top k that are accurate predictions of the user's future interactions.
The objective of a recommender engine is to model future interactions from past interactions. Such a model is trained on a dataset of interactions such that the last interaction is the target and n past interactions are the features.
The precision would therefore be calculated by running the model on a test set where the ground truth (last interaction) was known, and dividing the number of predictions where the ground truth was within the top k predictions by the total number of test items.
Items that the user has not interacted with do not come up because we are training the model on behaviour of other users.
In this case, one of the inputs is the probability of choosing an arm/action but how do we find that probability?
Isn't finding that probability itself a big task in hand?
Supplying the probability means you are taking a scenario where you are feeding actions taken historically, e.g. from a log, rather than performing the real online scenario. This is useful because (at least some of) Vowpal's Contextual Bandits models can be bootstrapped from historical data. Meaning, a Contextual Bandits policy learnt over historical data can outperform one that learns online from scratch ― something that you can do only if you have historical data relevant to the online scenario of yours.
The Wiki page has been recently edited to better reflect that this format generalizes for this case.
Another (contrived) use case for including probabilities might be that you are acting against multiple environments, but in any event to the best of my understanding the probability here can be interpreted as a mere frequency.
As such, my understanding is you do not have to supply the probability part in your input, when not feeding in historical interaction data. Just skip it as in the example here.
I am slightly confused as to what "feature selection / extractor / weights" mean and the difference between them. As I read the literature sometimes I feel lost as I find the term used quite loosely, my primary concerns are --
When people talk of Feature Frequency, Feature Presence - is it feature selection?
When people talk of algorithms such as Information Gain, Maximum Entropy - is it still feature selection.
If I train the classifier - with a feature set that asks the classifier to note the position of a word within a document as an example - would one still call this feature selection?
Thanks
Rahul Dighe
Rahul-
All of these are good answers. The one thing I would mention is that the fundamental difference between selection and extraction has to do with how you are treating the data.
Feature Extraction methods are transformative -- that is you are applying a transformation to your data to project it into a new feature space with lower dimension. PCA, and SVD are examples of this.
Feature Selection methods choose features from the original set based on some criteria, Information Gain, Correlation and Mutual Information are just criteria that are used to filter out unimportant or redundant features. Embedded or wrapper methods, as they are called, can use specialized classifiers to achieve feature selection and classify the dataset at the same time.
A really nice overview of the problem space is given here.
Good Luck!
Feature extraction: reduce dimensionality by (linear or non-
linear) projection of D-dimensional vector onto d-dimensional
vector (d < D).
Example: principal component analysis
Feature selection: reduce dimensionality by selecting subset
of original variables.
Example: forward or backward feature selection
Feature Selection is the process of choosing "interesting" features from your set for further processing.
Feature Frequency is just that, the frequency that a feature appears.
Information Gain, Maximum Entropy, etc. are weighting methods, which use Feature Frequency, which in turn, allow you to perform Feature Selection.
Think of it like this:
You parse a corpus, and create a term / document matrix. This matrix starts out as a count of the terms, and what document in which they appear (simple frequency).
To make that matrix more meaningful, you weight the terms based on some function including the frequency (like term frequency-inverse document frequency, Information gain, maximum entropy). Now that matrix contains the weights, or importance of each term in relation to the other terms in the matrix.
Once you have that, you can use feature selection to keep only the most important terms (if you are doing stuff like classification or categorization) and perform further analysis.